Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study

2025·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Importance Manual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. Objective To investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. Design In this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. Setting The study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. Participants 400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. Exposures Automated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Main Outcomes and Measures Primary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. Results The analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%–98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs −0.8pp [90% CI, −1.3 to −0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%–88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference −7.7 pp [90% CI, −11.6 to −3.8 pp]). Mean human time per report was 120 seconds vs 11–63 seconds for LLMs. Conclusion and Relevance In this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. Study Registration OSF: 45PVQ Key Points Question Can locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? Findings In this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. Meaning LLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingMachine Learning in Healthcare

Volltext beim Verlag öffnen

Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen