Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Reliability and Performance of Four Large Language Models in Orthodontic Knowledge Assessment
0
Zitationen
4
Autoren
2025
Jahr
Abstract
Artificial intelligence-based large language models (LLMs) are gaining prominence as educational tools. This study evaluated the accuracy and reliability of four popular publicly available LLM models-ChatGPT 4.0, ChatGPT 4o, Google Gemini, and Microsoft CoPilot-in answering orthodontic questions from the National Board of Dental Examiners examinations. Each model was tested across three trials to assess response consistency. Reliability was analyzed using Cohen's and Fleiss' Kappa. Among the four tested models, Microsoft CoPilot demonstrated the highest reliability, while ChatGPT-4.0 had the highest accuracy. Variability across trials suggests that AI-generated responses remain inconsistent. The variable responses generated over time by LLMs limit their standalone applicability in orthodontic education. Older models at times outperformed newer models. AI model updates do not necessarily lead to improved reliability. Although AI models may show potential as supplementary study aids, their accuracy and stability require further refinement before being deployed in educational contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.