Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating Large Language Models performance in Endodontics: A clinical experimental study

2026·0 Zitationen·Research Society and DevelopmentOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

This study aims to evaluate the diagnostic accuracy, consistency and diagnostic success rates of eight different AI-based chatbots in Endodontics. This cross-sectional study evaluated diagnostic accuracy of eight diverse AI models, selected for architectural/developer heterogeneity and clinical relevance, using 12 validated fictitious endodontic cases aligned with AAE guidelines and ethical approval was waived as no human data were used. STROBE guidelines were followed to ensure methodological rigor. Standardized prompts ensured uniformity, with three independent executions per case to assess consistency. Responses were anonymized and evaluated by blinded, calibrated reviewers and statistical analysis included Kruskal-Wallis, Dunn’s tests, Fleiss’ Kappa, and chi-square to compare diagnostic/treatment accuracy and intramodel agreement. The analysis revealed significant diagnostic accuracy variation among AI models (p < 0.001), with ChatGPT o1 (97%), Claude (97%), and DeepSeek (90.9%) outperforming Gemini (54.5%). Treatment recommendations showed uniformly high accuracy (97–100%, p = 0.537). Multivariate regression confirmed ChatGPT o1 (OR=32.7) and Claude (OR=30.5) as superior, though complex diagnoses (e.g., acute apical abscess, asymptomatic irreversible pulpitis) reduced accuracy (OR=0.01–0.3, p<0.05). Stratified analysis identified model-specific vulnerabilities: Gemini failed in reversible pulpitis (0/3, p=0.001) and chronic apical abscess (0/3, p=0.001), while ChatGPT o1 struggled with acute apical abscess (0/3, p<0.001). Overall agreement was 93%, with high intraclass reliability (ICC >0.85) for top models versus Gemini (ICC=0.65). Fleiss’ Kappa highlighted moderate agreement (κ=0.28–0.45) in ambiguous cases, emphasizing heterogeneous reliability. In conclusion, seven AI chatbots demonstrated high accuracy in endodontics cases, being considered as helpful tools for complement of clinical practice.

Autoren

Institutionen

Universidade do Estado do Pará(BR)

Themen

Artificial Intelligence in Healthcare and EducationElectronic Health Records SystemsDental Research and COVID-19

Volltext beim Verlag öffnen

Evaluating Large Language Models performance in Endodontics: A clinical experimental study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen