Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability

2025·4 Zitationen·International Dental ResearchOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Aim: The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions. Methodology: Thirty open-ended frequently asked questions were generated using the AlsoAsked and AnswerThePublic websites. Two experienced endodontists scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Group comparisons were performed using the ANOVA/Kruskal-Wallis test, followed by post-hoc Dunn-Bonferroni tests. Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p < 0.001). A significant difference was found between the models in terms of accuracy (p < 0.001): DeepSeek-V3.1 (4.63 ± 0.81) scored the highest and performed significantly better than ChatGPT-5 (3.93 ± 0.79) and Gemini 2.5 Flash (3.67 ± 0.76). No significant difference between ChatGPT-5 and Gemini 2.5 Flash (p > 0.05). The understandability (PEMAT-P) scores were similar (p = 0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p = 0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p = 0.001); in GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p = 0.036) and DeepSeek-V3.1 (p < 0.001); in SMOG, Gemini outperformed DeepSeek-V3.1 (p = 0.003); and in CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p = 0.004). No significant correlation was found between readability and understandability (p > 0.05). Conclusion: DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level.

Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen