Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
An Open-Source Large Language Model (LLM) Demonstrates Potential for Labeling Features From Radiology Reports
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract RATIONALE: Accurate text interpretation using traditional natural language processing (NLP) approaches has historically been a roadblock for incorporating text-based findings into clinical decision support tools. Advances in NLP, most notably large language models (LLMs), have unlocked the ability to identify clinically relevant results from unstructured text reports. Here we apply the open-source Llama-3 LLM to chest radiograph reports and evaluate the accuracy of Llama-3 at labeling radiology text reports for features relevant to creating pneumonia severity clinical decision support. METHODS: Chest radiograph reports were obtained from patients with an ICD-10 primary diagnosis of pneumonia present on admission (J9-J18) who were admitted to a single center in New York City from January 2017 to September 2019. Llama-3 was run locally using the Ollama python package to evaluate a random selection of 200 radiology reports. The beginning of all prompts included an initial text to the LLM setting the context. These initial instructions were created with trial-and-error tuning. A python script prompted Llama-3 with three questions for each radiology report: “Does the chest x-ray show a consolidation?”; “How many sides of the lungs show evidence of pneumonia?”; and “Does the chest x-ray show evidence of an atypical pneumonia?”; The LLM was instructed to answer either a “Yes” or “No” for the first prompt; “No sides,” “One side,” or “Two sides” for the second prompt; and “Yes,” “No,” or “Maybe” for the third prompt. To evaluate the performance of the LLM, two clinicians reviewed the radiology reports and answered the above questions. Percent agreement and Cohen's Kappa were calculated between the clinicians. Any disagreements were arbitrated by a third clinician, with the resulting clinician labels used as the reference for comparison to Llama-3. RESULTS: The percent agreement and Cohen's Kappa for the initial two clinician reviewers and between Llama-3 and the clinician reference is shown in Table 1. CONCLUSIONS: Clinicians demonstrated moderate to good agreement at labeling features from radiology reports while Llama-3 demonstrated moderate to poor agreement, depending on the feature. Radiology reports have some ambiguity, and even clinicians could not achieve perfect agreement labeling features. In the future, further tuning of the LLM prompt and clarifying the questions may improve performance. Forthcoming new model releases will also likely improve performance. Llama-3 has potential for labeling features from radiology reports for a variety of potential uses, including creating clinical decision support and predictive modeling.
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.817 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.514 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.745 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.103 Zit.