Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of Large Language Models in Oral Health Consultations and the Consistency of the ‘AI-as-a-Judge’ Framework
0
Zitationen
3
Autoren
2026
Jahr
Abstract
OBJECTIVE: To evaluate the performance of large language models (LLMs) in responding to oral health consultations and to examine the consistency between the AI-as-a-Judge evaluation framework and human expert ratings. METHODS: Nine oral health questions were selected from the World Dental Federation (FDI) official website and posed to 6 models: GPT-5.0, Gemini-3.0, DeepSeek-V3, Qwen3-Max, Kimi-K2 and Doubao-1.8-Pro. Responses were independently scored by 2 clinicians and 3 AI judges. RESULTS: Significant performance differences were observed among the 6 models, with DeepSeek-V3 and Doubao-1.8-Pro achieving the best results. Inter-rater consistency among human experts was good (ICC = 0.860), while consistency among AI judges was low (ICC = 0.538). Human-AI consistency was extremely low (ICC = 0.215) and AI judges exhibited a significantly stricter scoring tendency. CONCLUSION: Leading domestic LLMs have attained competitive performance in oral health consultations. However, the current 'AI-as-a-Judge' framework demonstrates significant inconsistency and bias compared to human experts, suggesting that automated AI evaluation systems are not yet a reliable substitute for human expert review in clinical contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.611 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.504 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.025 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.835 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.