Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Prompt-dependent performance of multimodal AI model in oral diagnosis: a comprehensive analysis of accuracy, narrative quality, calibration, and latency versus human experts

2025·13 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Prompt design is a critical yet underexplored factor influencing the diagnostic performance of large language models (LLMs). Gemini Pro 2.5 shows promise in multimodal reasoning, but no prior study has systematically compared prompt structures in oral datasets against expert benchmarks. This study aimed to evaluate the diagnostic performance of a multimodal LLM (Gemini Pro 2.5) under different prompting strategies compared with oral medicine experts using prospective, histopathology-verified clinical vignettes. In a prospective, paired diagnostic accuracy study, Gemini pro 2.5 (a multimodal LLM) was evaluated under three prompting strategies: Direct (P-1), Chain-of-Thought (P-2), and Self-Reflection (P-3) on 300 oral lesion cases with histopathologic confirmation. Each prompt was applied to identical inputs and compared against diagnoses from board-certified oral medicine specialists. Accuracy, rubric-based narrative quality, probability calibration, and computational efficiency were assessed under STARD-AI guidelines. Human experts achieved the highest Top-1 accuracy (61%), but Chain-of-Thought prompting (P-2) led AI performance in Top-3 accuracy (82%) and produced the highest explanation quality (mean rubric score 8.49/10). No AI prompt matched human performance in low-difficulty cases. P-2 also showed the best calibration (Brier score 0.238) compared to P-1 and P-3. Resource-wise, Direct prompting was fastest, but longer outputs modestly improved Top-3 recall. Mixed-effects modeling confirmed that AI performance varied significantly by prompt structure, highlighting context-specific trade-offs. Prompt structure significantly affects the diagnostic performance and interpretability of AI-generated differentials in oral lesion diagnosis. While expert clinicians remain superior in straightforward cases, structured prompting, particularly Chain-of-Thought, may enhance AI reliability in complex diagnostic scenarios. These findings support the integration of prompt engineering into AI-assisted diagnostic tools to augment clinical decision-making in oral medicine.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Prompt-dependent performance of multimodal AI model in oral diagnosis: a comprehensive analysis of accuracy, narrative quality, calibration, and latency versus human experts

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen