OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 19:35

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability

2025·1 Zitationen·International Dental ResearchOpen Access
Volltext beim Verlag öffnen

1

Zitationen

5

Autoren

2025

Jahr

Abstract

Aim: The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions. Methodology: Thirty open-ended frequently asked questions were generated using the AlsoAsked and AnswerThePublic websites. Two experienced endodontists scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Group comparisons were performed using the ANOVA/Kruskal-Wallis test, followed by post-hoc Dunn-Bonferroni tests. Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p<0.001). A significant difference was found between the models in terms of accuracy (p<0.001): DeepSeek-V3.1 (4.63±0.81) scored the highest and performed significantly better than ChatGPT-5 (3.93±0.79) and Gemini 2.5 Flash (3.67±0.76). No significant difference between ChatGPT-5 and Gemini 2.5 Flash (p>0.05). The understandability (PEMAT-P) scores were similar (p=0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p=0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p=0.001); in GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p=0.036) and DeepSeek-V3.1 (p < 0.001); in SMOG, Gemini outperformed DeepSeek-V3.1 (p=0.003); and in CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p=0.004). No significant correlation was found between readability and understandability (p>0.05). Conclusion: DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level. How to cite this article: Taşyürek M, Adıgüzel Ö, Gündoğar M, Goncharuk-Khomyn M, Ortaç H. Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. Int Dent Res. 2025;15(3) (Advanced Online). https://doi.org/10.5577/intdentres.662

Ähnliche Arbeiten