Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating large language models using national endodontic specialty examination questions: are they ready for real-world dentistry?
10
Zitationen
2
Autoren
2025
Jahr
Abstract
BACKGROUND: Large Language Models (LLMs) are artificial intelligence (AI) systems that simulate human language processing through deep learning techniques and neural networks. They are increasingly utilized for clinical decision support, student training, and enhancing educational processes. However, the reliability of AI models, especially in answering various types of questions, remains a point of debate. Standard multiple-choice questions (MCQs) involve selecting one correct answer from five options, whereas combination-type MCQs (C-MCQs) identify all correct statements among several alternatives. This study aims to evaluate and compare the performance of various LLMs in answering MCQs and C-MCQs in endodontics. METHODS: A total of 151 endodontic questions were identified through a comprehensive review of publicly available Dentistry Specialty Exams in Turkey conducted since 2012. The questions were presented to eight LLMs (ChatGPT-4o, ChatGPT-4, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash, Copilot, Deepseek-V3, and Qwen2.5-Max) in Turkish. Accuracy rates for both MCQs and C-MCQs were statistically analyzed using SPSS v23 (p < 0.05). RESULTS: ChatGPT-4o achieved the highest overall accuracy rate (81.5%), while Gemini 1.5 Flash had the lowest (57%). In standard MCQs, ChatGPT-4o significantly outperformed the other models (p < 0.001), but in C-MCQs, no significant difference was observed between the models (p = 0.179). Across all models, accuracy rates for C-MCQs were significantly lower than for MCQs (p < 0.05). Deepseek-V3 maintained a more balanced performance across question types than the other models. CONCLUSIONS: LLMs show promising potential as educational tools in endodontics. However, their accuracy varies by question type and model. They can support student learning and clinical decision-making but cannot yet be considered a fully reliable standalone source in endodontics.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.687 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.591 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.114 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.867 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.