Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Evaluation of Responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 Chatbots to Questions About Endodontic Iatrogenic Events
3
Zitationen
3
Autoren
2025
Jahr
Abstract
<b>Background:</b> The aim of this study was to compare four recently introduced LLMs (ChatGPT-5, Grok 4, Gemini 2.5 Flash, and Claude Sonnet-4). Experienced endodontists evaluated the accuracy, completeness, and readability of the responses given to open-ended questions about iatrogenic events in endodontics. <b>Methods:</b> Twenty-five open-ended questions related to iatrogenic events in endodontics were prepared. The responses of the four LLMs were evaluated by two specialist endodontists using a Likert scale for accuracy and completeness, and the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simplified Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) for readability. <b>Results:</b> The accuracy score of ChatGPT-5's responses to open-ended questions (4.56 ± 0.65) was found to be significantly higher than those of Gemini 2.5 Flash (3.64 ± 0.95) and Claude Sonnet-4 (3.44 ± 1.19) (<i>p</i> = 0.009, and <i>p</i> = 0.002, respectively). Similarly, the completeness score of ChatGPT-5 (2.88 ± 0.33) was higher than those of Claude Sonnet-4, Gemini 2.5 Flash, and Grok 4 (<i>p</i> < 0.001, <i>p</i> = 0.002, and <i>p</i> = 0.007, respectively). In terms of readability measures, ChatGPT-5 and Gemini 2.5 Flash achieved better FRESs than Claude Sonnet-4 (<i>p</i> = 0.003, and <i>p</i> < 0.001, respectively). Conversely, FKGL scores were higher for Claude Sonnet-4 and Grok 4 compared to ChatGPT-5 (<i>p</i> < 0.001, and <i>p</i> = 0.008, respectively). Correlation analyses revealed a strong positive association (r<sub>s</sub> = 0.77; <i>p</i> < 0.001) between accuracy and completeness, a weak negative correlation (r<sub>s</sub> = -0.19; <i>p</i> = 0.047) between completeness and FKGL, and a strong negative correlation between (r<sub>s</sub> = -0.88; <i>p</i> < 0.001) FKGL and FRES. Additionally, ChatGPT-5 demonstrated lower GFI and CLI scores than the other models, while its SMOG scores were lower than those of Gemini 2.5 Flash and Grok 4 (<i>p</i> = 0.001, and <i>p</i> < 0.001, respectively). <b>Conclusions:</b> Although differences were observed between the LLMs in terms of the accuracy and completeness of the responses, ChatGPT-5 showed the best performance. Even with high scores of accuracy (excellent) and completeness (comprehensive), it must not be forgotten that incorrect information can lead to serious outcomes in healthcare services. Therefore, the readability of responses is of critical importance, and when selecting a model, readability should be evaluated together with content quality.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.287 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.140 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.534 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.450 Zit.