Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A systematic evaluation of open-source large language models for automated extraction of cardiac MRI parameters from unstructured reports
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Abstract Background Cardiac magnetic resonance imaging (CMR) parameters are often stored in heterogeneous, unstructured clinical reports. Manual adjudication of these parameters is time-consuming and can require domain expertise. Recent open-source large language models (LLMs) have demonstrated impressive performance on language-task benchmarks. Moreover, they are cost-effective and readily adaptable to data privacy requirements and custom use-cases. To date, no prior study has systematically evaluated the performance of state-of-the-art (SOTA) open-source LLMs in extracting cardiac parameters from real-world CMR reports. Purpose To investigate and compare the ability of several SOTA open-source LLMs to automatically and accurately extract key cardiac parameters from unstructured CMR reports. Methods We retrospectively collected 1108 CMR reports from a single academic institution. Seven open-source LLMs varying in parameter size (2 to 9 billion) and pretraining corpus (general vs biomedical text) were evaluated, including Gemma 2-2B, Gemma 2-9B, Llama 3.2-3B, Llama 3.1-8B, Qwen 2.5-7B, BioMistral-7B, and Meditron 3-8B. We chose LLMs with fewer than 10B parameters to align with the computational capacity of commonly used GPUs in research settings. Each model was prompted to extract the following CMR parameters: cardiac output (CO), cardiac index (CI), left and right ventricular (LV/RV) ejection fraction (LVEF, RVEF), LV/RV end-systolic volume index (LVESVI, RVESVI), LV/RV stroke volume index (LVSVI, RVSVI), LV late gadolinium enhancement (LV LGE), LV LGE type (ischemic, non-ischemic, mixed), and T2 positivity. Results were pooled across three runs to account for LLM stochasticity. Concordance between model outputs and human expert-adjudicated values were computed. Results Gemma 2-9B achieved the highest concordance with human annotation on 6 out of the 11 CMR parameters, including CI (100%), CO (99%), LVEF (97%), RVEF (99%), LVSVI (97%), and RVSVI (99%), Figure 1A. Qwen 2.5-7B performed the best for LVESVI (98%), RVESVI (99%), and T2 positivity (95%). T2 positivity exhibited the highest average concordance across models (92.4%), while LV LGE type had the lowest (76.3%). Larger-parameter models (Gemma 2-9B, Llama 3.1-8B) consistently outperformed their smaller-parameter counterparts (Gemma 2-2B, Llama 3.2-3B), Figure 1A. Surprisingly, the two medical-domain models, BioMistral-7B and Meditron 3-8B, generally exhibited lower performance compared to the non-medical LLMs. Indeed, BioMistral-7B was the worst-performing LLM for all 11 CMR parameters, Figure 1B. This suggests that medical (domain-specific) pretraining may negatively affect adjudication performance. Conclusion Open-source LLMs show promise for automated and accurate extraction of cardiac parameters from unstructured CMR reports. Larger-parameter, general-pretrained LLMs—rather than LLMs trained on biomedical data—provide more accurate adjudications.Fig 1.LLM CMR adjudication performance
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.700 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.605 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.133 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.873 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.