Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language models in radiologic numerical tasks: A thorough evaluation and error analysis
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract Purpose To investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis. Materials and Methods We defined six tasks: extracting 1-minimum T-score from DEXA report, 2-maximum common bile duct (CBD) diameter from ultrasound report, and 3-maximum lung nodule size from CT report, and judging 1-presence of a highly hypermetabolic region on a PET report, 2-whether a patient is osteoporotic based on a DEXA report, and 3-whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution’s databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis. Results In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies >95%). In judgement tasks, the lowest accuracies of Llama, DeepSeek, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek but not in o1-mini or GPT-5-mini. Conclusion True reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-true reasoning models may also achieve acceptable performance depending on the task.