OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 14.03.2026, 00:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

From Radiology Findings to Artificial Intelligence (AI) Powered Impressions: A Retrospective Study on the Comparative Performance of Recent Large Language Models

2025·0 Zitationen·Intelligent MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

Large language models (LLMs), a revolutionary breakthrough in Artificial Intelligence, can be leveraged to automatically generate the impressions for radiology reports, which usually requires time, effort, and training. Our objective was to evaluate the performance of five recent LLMs (GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1) for impression generation. In this retrospective study, 100 radiology reports were sampled (20 from each of the report-groups 0-400, 400-800, 800-1,200, 1,200-2,000, and 2,000–8,000 based on character count of findings section) from the publicly available “BioNLP 2023 report summarization” dataset (collected between 2001-2016, training subset of size 59,320 considered for sampling), sourced from PhysioNet. Then each of the five LLMs was zero-shot prompted to generate impressions using the findings from the sample. Generated impressions were evaluated: (a) subjectively for coherence, comprehensiveness, conciseness, and medical harmfulness by two radiology fellows and a large reasoning model (LRM) Gemini 2.5 – Pro, and (b) objectively using a composite accuracy metric (ROUGE-1, BLEU, and Cosine Similarity) against the original human expert-generated impressions. The LLMs were ranked according to the percentage agreement ranking of subjective scores and composite scores. Statistical tests (Friedman test and post-hoc Nemenyi test) were used to assess inter-model differences. The top-ranked models were Gemini 1.5 – Pro, GPT-4, and Gemini 1.5 – Flash. Performance varied across models for both human and LRM raters (Friedman test: Human P < 1.82 × 10⁻⁶; LRM P < 9.10 × 10⁻⁴⁰). Composite accuracy scores were significantly higher for the top three models (0.69, 0.68, 0.68) versus others (0.65; Nemenyi P < 1.11 × 10⁻¹⁶). The LRM aligned closely with human raters (2.15% complete disagreement) and identified all human-rated inaccurate impressions. Gemini 1.5 – Pro outperformed GPT-4, in terms of coherence, comprehensiveness, and medical harmfulness, at lower cost. Human and LRM evaluations were generally consistent, though the LRM was more conservative.

Ähnliche Arbeiten