Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From Radiology Findings to Artificial Intelligence (AI) Powered Impressions: A Retrospective Study on the Comparative Performance of Recent Large Language Models
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Large language models (LLMs), a revolutionary breakthrough in Artificial Intelligence, can be leveraged to automatically generate the impressions for radiology reports, which usually requires time, effort, and training. Our objective was to evaluate the performance of five recent LLMs (GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1) for impression generation. In this retrospective study, 100 radiology reports were sampled (20 from each of the report-groups 0-400, 400-800, 800-1,200, 1,200-2,000, and 2,000–8,000 based on character count of findings section) from the publicly available “BioNLP 2023 report summarization” dataset (collected between 2001-2016, training subset of size 59,320 considered for sampling), sourced from PhysioNet. Then each of the five LLMs was zero-shot prompted to generate impressions using the findings from the sample. Generated impressions were evaluated: (a) subjectively for coherence, comprehensiveness, conciseness, and medical harmfulness by two radiology fellows and a large reasoning model (LRM) Gemini 2.5 – Pro, and (b) objectively using a composite accuracy metric (ROUGE-1, BLEU, and Cosine Similarity) against the original human expert-generated impressions. The LLMs were ranked according to the percentage agreement ranking of subjective scores and composite scores. Statistical tests (Friedman test and post-hoc Nemenyi test) were used to assess inter-model differences. The top-ranked models were Gemini 1.5 – Pro, GPT-4, and Gemini 1.5 – Flash. Performance varied across models for both human and LRM raters (Friedman test: Human P < 1.82 × 10⁻⁶; LRM P < 9.10 × 10⁻⁴⁰). Composite accuracy scores were significantly higher for the top three models (0.69, 0.68, 0.68) versus others (0.65; Nemenyi P < 1.11 × 10⁻¹⁶). The LRM aligned closely with human raters (2.15% complete disagreement) and identified all human-rated inaccurate impressions. Gemini 1.5 – Pro outperformed GPT-4, in terms of coherence, comprehensiveness, and medical harmfulness, at lower cost. Human and LRM evaluations were generally consistent, though the LRM was more conservative.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.