Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

From findings to next steps: Evaluating LLMs in abdominal ultrasound-driven clinical management

2026·0 Zitationen·European Journal of Radiology Artificial IntelligenceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

To determine whether state-of-the-art large language models (LLMs) can translate standardized abdominal ultrasound (US) findings into guideline-concordant management recommendations. We assembled 200 synthetic abdominal US cases authored by radiologists using a fixed template; 30 cases were used for prompt development and optimization, while the remaining 170 cases were used to test LLM performance. Two expert abdominal radiologists independently established the reference standard (no further imaging, US follow-up, or additional imaging), specifying follow-up intervals (months) or preferred modality (CT, MRI, CEUS, PET). Discrepancies were solved by consensus. Seven LLMs (GPT-5, GPT-4o, GPT-4o mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude 4 Sonnet, DeepSeek-V3) were evaluated using a zero-shot prompt. On the 170-case test set, each model generated five independent inferences per case; the majority output defined the representative recommendation. Performance was summarized with accuracy, macro/weighted F1, Cohen’s κ, Krippendorff’s α. Two radiologists qualitatively rated justification quality on a 5-point Likert scale. According to radiologists, 35/170(20.6 %) cases required no further imaging, 21/170(12.4 %) US follow-up, and 114/170(67.1 %) additional imaging. For management category, accuracy of LLMs ranged from 98.8 % (GPT-5) to 82.9 % (DeepSeek-V3); multi-rater reliability across experts and LLMs yielded α= 0.828. For modality selection, accuracy ranged from 77.9 %(GPT-5) to 60.9 %(Gemini 2.5 Flash), with most errors between CT and MRI. The follow-up interval performance was perfect for GPT-5 and GPT-4o. GPT-5 achieved the highest proportion of high-quality justifications (scores 4–5: 93.5 %) Widely accessible LLMs demonstrated strong potential in converting abdominal US reports into guideline-concordant next-step recommendations. GPT-5 delivered the best overall performance. • GPT-5 had the highest management classification accuracy (98.8 %) among 170 synthetic abdominal US cases. • GPT-5 showed highest modality agreement with experts (κ=0.664), comparable to expert inter-observer agreement (κ=0.731). • GPT-5 justifications were high quality: 93.5 % scored 4–5 on a 5-point Likert scale.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationUltrasound in Clinical ApplicationsRadiology practices and education

Volltext beim Verlag öffnen

From findings to next steps: Evaluating LLMs in abdominal ultrasound-driven clinical management

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen