Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study
0
Zitationen
14
Autoren
2025
Jahr
Abstract
Abstract Importance Manual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. Objective To investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. Design In this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. Setting The study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. Participants 400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. Exposures Automated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Main Outcomes and Measures Primary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. Results The analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%–98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs −0.8pp [90% CI, −1.3 to −0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%–88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference −7.7 pp [90% CI, −11.6 to −3.8 pp]). Mean human time per report was 120 seconds vs 11–63 seconds for LLMs. Conclusion and Relevance In this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. Study Registration OSF: 45PVQ Key Points Question Can locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? Findings In this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. Meaning LLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.303 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.155 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.555 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.453 Zit.