Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks
1
Zitationen
8
Autoren
2025
Jahr
Abstract
<title>Abstract</title> This study evaluated confidence calibration across 48 large language models (LLM) using 300 gastroenterology board exam style questions. Regardless of response accuracy, all models demonstrated poor certainty estimation. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) showed substantial overconfidence (Brier scores 0.15-0.2, AUROC ~0.6). Most concerning, models maintained high certainty regardless of question difficulty or their actual knowledge limitations. This metacognitive deficiency poses significant challenges for safe clinical implementation of current LLMs in gastroenterology.
Ähnliche Arbeiten
The Strengths and Difficulties Questionnaire: A Research Note
1997 · 14.516 Zit.
Making sense of Cronbach's alpha
2011 · 13.646 Zit.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies
2011 · 13.521 Zit.
A method for estimating the probability of adverse drug reactions
1981 · 11.446 Zit.
Evidence-Based Medicine
1992 · 4.133 Zit.