OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 25.03.2026, 16:30

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Generative Large Language Models Trained for Detecting Errors in Radiology Reports

2025·11 Zitationen·Radiology
Volltext beim Verlag öffnen

11

Zitationen

11

Autoren

2025

Jahr

Abstract

Background Large language models (LLMs) offer promising solutions, yet their application in medical proofreading, particularly in detecting errors within radiology reports, remains underexplored. Purpose To develop and evaluate generative LLMs for detecting errors in radiology reports during medical proofreading. Materials and Methods In this retrospective study, a dataset was constructed with two parts. The first part included 1656 synthetic chest radiology reports generated by GPT-4 (OpenAI) using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC chest radiograph (MIMIC-CXR) database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3 (Meta AI), GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using F1 scores, 95% CIs, and paired-sample <i>t</i> tests on the constructed dataset, with the prediction results further assessed by radiologists. Results Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance, with the following F1 scores: 0.769 (95% CI: 0.757, 0.771) for negation errors, 0.772 (95% CI: 0.762, 0.780) for left/right errors, 0.750 (95% CI: 0.736, 0.763) for interval change errors, 0.828 (95% CI: 0.822, 0.832) for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model (50 for each error type). Of these, 99 were confirmed by both radiologists to contain errors detected by the models, and 163 were confirmed by at least one radiologist to contain model-detected errors. Conclusion Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports. © RSNA, 2025 <i>Supplemental material is available for this article.</i> See also the editorial by Marrocchio and Sverzellati in this issue.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Radiology practices and educationArtificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen