Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From findings to next steps: Evaluating LLMs in abdominal ultrasound-driven clinical management
0
Zitationen
8
Autoren
2026
Jahr
Abstract
To determine whether state-of-the-art large language models (LLMs) can translate standardized abdominal ultrasound (US) findings into guideline-concordant management recommendations. We assembled 200 synthetic abdominal US cases authored by radiologists using a fixed template; 30 cases were used for prompt development and optimization, while the remaining 170 cases were used to test LLM performance. Two expert abdominal radiologists independently established the reference standard (no further imaging, US follow-up, or additional imaging), specifying follow-up intervals (months) or preferred modality (CT, MRI, CEUS, PET). Discrepancies were solved by consensus. Seven LLMs (GPT-5, GPT-4o, GPT-4o mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude 4 Sonnet, DeepSeek-V3) were evaluated using a zero-shot prompt. On the 170-case test set, each model generated five independent inferences per case; the majority output defined the representative recommendation. Performance was summarized with accuracy, macro/weighted F1, Cohen’s κ, Krippendorff’s α. Two radiologists qualitatively rated justification quality on a 5-point Likert scale. According to radiologists, 35/170(20.6 %) cases required no further imaging, 21/170(12.4 %) US follow-up, and 114/170(67.1 %) additional imaging. For management category, accuracy of LLMs ranged from 98.8 % (GPT-5) to 82.9 % (DeepSeek-V3); multi-rater reliability across experts and LLMs yielded α= 0.828. For modality selection, accuracy ranged from 77.9 %(GPT-5) to 60.9 %(Gemini 2.5 Flash), with most errors between CT and MRI. The follow-up interval performance was perfect for GPT-5 and GPT-4o. GPT-5 achieved the highest proportion of high-quality justifications (scores 4–5: 93.5 %) Widely accessible LLMs demonstrated strong potential in converting abdominal US reports into guideline-concordant next-step recommendations. GPT-5 delivered the best overall performance. • GPT-5 had the highest management classification accuracy (98.8 %) among 170 synthetic abdominal US cases. • GPT-5 showed highest modality agreement with experts (κ=0.664), comparable to expert inter-observer agreement (κ=0.731). • GPT-5 justifications were high quality: 93.5 % scored 4–5 on a 5-point Likert scale.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.