Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Large Language Models in Oral Health Consultations and the Consistency of the ‘AI-as-a-Judge’ Framework

2026·0 Zitationen·International Dental JournalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

OBJECTIVE: To evaluate the performance of large language models (LLMs) in responding to oral health consultations and to examine the consistency between the AI-as-a-Judge evaluation framework and human expert ratings. METHODS: Nine oral health questions were selected from the World Dental Federation (FDI) official website and posed to 6 models: GPT-5.0, Gemini-3.0, DeepSeek-V3, Qwen3-Max, Kimi-K2 and Doubao-1.8-Pro. Responses were independently scored by 2 clinicians and 3 AI judges. RESULTS: Significant performance differences were observed among the 6 models, with DeepSeek-V3 and Doubao-1.8-Pro achieving the best results. Inter-rater consistency among human experts was good (ICC = 0.860), while consistency among AI judges was low (ICC = 0.538). Human-AI consistency was extremely low (ICC = 0.215) and AI judges exhibited a significantly stricter scoring tendency. CONCLUSION: Leading domestic LLMs have attained competitive performance in oral health consultations. However, the current 'AI-as-a-Judge' framework demonstrates significant inconsistency and bias compared to human experts, suggesting that automated AI evaluation systems are not yet a reliable substitute for human expert review in clinical contexts.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling

Volltext beim Verlag öffnen

Performance of Large Language Models in Oral Health Consultations and the Consistency of the ‘AI-as-a-Judge’ Framework

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen