Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
SAT-227 Evaluation of Large Language Models in Bone Age Assessment: Keep Your Radiologist
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Abstract Disclosure: J. Tarkoff: None. Introduction: The application of artificial intelligence (AI) in medicine has garnered significant attention, with numerous potential use cases. Recently, Elon Musk endorsed uploading medical images to Grok, his AI platform, suggesting its potential role in diagnostic imaging for both patients and clinicians. This study evaluates the performance of four large language models (LLMs)—Chat GPT-4o, Claude 3.5 Sonnet, O1-reasoning, and Grok 2—against a radiologist and an endocrinologist in assessing bone age radiographs. Methods: Radiographs from consecutive patients presenting with growth or pubertal concerns between December 2024 and January 2025 were included. Bone age was independently assessed by an endocrinologist, a radiologist, and four LLMs. A standardized prompt was used for the LLMs, specifying patient sex and requesting bone age estimation. If an LLM's initial prediction deviated by >3 years from the endocrinologist’s reading, it was prompted to reassess. Absolute errors were calculated for each method. Statistical analysis included paired t-tests comparing LLMs to the radiologist and ANOVA to assess differences across all methods. Results: A total of 10 patients (mean age: 8.9 years; 50% female) were evaluated. The mean absolute errors (MAE) compared to the endocrinologist were as follows: Radiology: 0.38 years; Chat GPT-4o: 1.25 years; Grok 2: 1.40 years; O1-reasoning: 1.45 years; and Claude 3.5 Sonnet: 2.18 years. All LLMs performed significantly worse than Radiology (p < 0.001). Significant differences were also observed among the methods (Radiology, Chat GPT-4o, Claude 3.5 Sonnet, O1-reasoning, Grok 2) (p < 0.001). While Chat GPT-4o exhibited the best performance among LLMs, it remained inferior to human experts. Conclusions: Despite recent advancements in AI, LLMs currently lack the accuracy required for clinical bone age assessment compared to radiologists or endocrinologists. These findings underscore the need for further refinement of AI models before their safe integration into clinical workflows or patient use. Although platforms like Grok show promise, their application should remain exploratory until substantial improvements are achieved. Presentation: Saturday, July 12, 2025
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.906 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.591 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.770 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.110 Zit.