Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Consistency-Based Confidence in Multimodal Large Language Models on Radiology Cases: Comparison with Self-Report (Preprint)
0
Zitationen
4
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language models (LLMs) require specialized methodologies to quantify model confidence for safe deployment in healthcare systems; however, there is a lack of established methods for confidence assessment. </sec> <sec> <title>OBJECTIVE</title> To evaluate output consistency as a confidence metrics for multimodal-LLMs interpreting radiology cases and compare with self-reported. </sec> <sec> <title>METHODS</title> From a total of 311 quizzes in the Korean Society of Ultrasound in Medicine digital platform, we selected 75 multiple-choice cases. Six multimodal-LLMs were evaluated, three reasoning-focused models (o1, Claude-3.7-Sonnet, Gemini-2.5-Pro) and three general models (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro). Temperature was fixed at 1.0. Two confidence metrics were assessed: (i) self-reported by LLMs using prompts that elicited direct confidence percentages with answer, and (ii) consistency-based metrics derived from 20 repeated interpretations per case including relative entropy (R_H) calculated as 1 - H/log₂k (H = Shannon-entropy, k = number of repetitions) and majority vote proportion. Receiver Operating Characteristics (ROC) analysis for discrimination and Spearman correlation (r) between accuracy and each confidence metric was conducted. Additionally, model calibration was assessed using Expected Calibration Error (ECE). </sec> <sec> <title>RESULTS</title> Consistency-based metrics demonstrated significant correlation with diagnostic accuracy for Claude-3.7-Sonnet (percentage, r=0.314; R_H, r=0.310), Gemini-2.5-Pro (percentage, r=0.354; R_H, r=0.347), and GPT-4o (percentage, r=0.321; R_H, r=0.318). ROC analysis revealed that consistency-based metrics outperformed self-reported confidence in discriminative ability, with area under the curve values of 0.663 (95% CI: 0.545–0.768) for Claude-3.7-Sonnet, 0.694 (95% CI: 0.577–0.795) for Gemini-2.5-Pro, and 0.671 (95% CI: 0.533–0.775) for GPT-4o. For consistency-based metrics, Regular ECE (10-bin) ranged from 0.313–0.485, while optimal ECE ranged from 0.276–0.478 with varying bin configurations. </sec> <sec> <title>CONCLUSIONS</title> In multimodal-LLMs applied to radiology case, consistency-based metrics provide a more dependable indicator of diagnostic confidence than the self‑report. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.493 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.377 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.835 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.555 Zit.