OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 22.03.2026, 18:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Evaluation of Responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 Chatbots to Questions About Endodontic Iatrogenic Events

2025·3 Zitationen·HealthcareOpen Access
Volltext beim Verlag öffnen

3

Zitationen

3

Autoren

2025

Jahr

Abstract

<b>Background:</b> The aim of this study was to compare four recently introduced LLMs (ChatGPT-5, Grok 4, Gemini 2.5 Flash, and Claude Sonnet-4). Experienced endodontists evaluated the accuracy, completeness, and readability of the responses given to open-ended questions about iatrogenic events in endodontics. <b>Methods:</b> Twenty-five open-ended questions related to iatrogenic events in endodontics were prepared. The responses of the four LLMs were evaluated by two specialist endodontists using a Likert scale for accuracy and completeness, and the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simplified Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) for readability. <b>Results:</b> The accuracy score of ChatGPT-5's responses to open-ended questions (4.56 ± 0.65) was found to be significantly higher than those of Gemini 2.5 Flash (3.64 ± 0.95) and Claude Sonnet-4 (3.44 ± 1.19) (<i>p</i> = 0.009, and <i>p</i> = 0.002, respectively). Similarly, the completeness score of ChatGPT-5 (2.88 ± 0.33) was higher than those of Claude Sonnet-4, Gemini 2.5 Flash, and Grok 4 (<i>p</i> < 0.001, <i>p</i> = 0.002, and <i>p</i> = 0.007, respectively). In terms of readability measures, ChatGPT-5 and Gemini 2.5 Flash achieved better FRESs than Claude Sonnet-4 (<i>p</i> = 0.003, and <i>p</i> < 0.001, respectively). Conversely, FKGL scores were higher for Claude Sonnet-4 and Grok 4 compared to ChatGPT-5 (<i>p</i> < 0.001, and <i>p</i> = 0.008, respectively). Correlation analyses revealed a strong positive association (r<sub>s</sub> = 0.77; <i>p</i> < 0.001) between accuracy and completeness, a weak negative correlation (r<sub>s</sub> = -0.19; <i>p</i> = 0.047) between completeness and FKGL, and a strong negative correlation between (r<sub>s</sub> = -0.88; <i>p</i> < 0.001) FKGL and FRES. Additionally, ChatGPT-5 demonstrated lower GFI and CLI scores than the other models, while its SMOG scores were lower than those of Gemini 2.5 Flash and Grok 4 (<i>p</i> = 0.001, and <i>p</i> < 0.001, respectively). <b>Conclusions:</b> Although differences were observed between the LLMs in terms of the accuracy and completeness of the responses, ChatGPT-5 showed the best performance. Even with high scores of accuracy (excellent) and completeness (comprehensive), it must not be forgotten that incorrect information can lead to serious outcomes in healthcare services. Therefore, the readability of responses is of critical importance, and when selecting a model, readability should be evaluated together with content quality.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in Service InteractionsCOVID-19 diagnosis using AI
Volltext beim Verlag öffnen