Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

SAT-227 Evaluation of Large Language Models in Bone Age Assessment: Keep Your Radiologist

2025·0 Zitationen·Journal of the Endocrine SocietyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Disclosure: J. Tarkoff: None. Introduction: The application of artificial intelligence (AI) in medicine has garnered significant attention, with numerous potential use cases. Recently, Elon Musk endorsed uploading medical images to Grok, his AI platform, suggesting its potential role in diagnostic imaging for both patients and clinicians. This study evaluates the performance of four large language models (LLMs)—Chat GPT-4o, Claude 3.5 Sonnet, O1-reasoning, and Grok 2—against a radiologist and an endocrinologist in assessing bone age radiographs. Methods: Radiographs from consecutive patients presenting with growth or pubertal concerns between December 2024 and January 2025 were included. Bone age was independently assessed by an endocrinologist, a radiologist, and four LLMs. A standardized prompt was used for the LLMs, specifying patient sex and requesting bone age estimation. If an LLM's initial prediction deviated by &gt;3 years from the endocrinologist’s reading, it was prompted to reassess. Absolute errors were calculated for each method. Statistical analysis included paired t-tests comparing LLMs to the radiologist and ANOVA to assess differences across all methods. Results: A total of 10 patients (mean age: 8.9 years; 50% female) were evaluated. The mean absolute errors (MAE) compared to the endocrinologist were as follows: Radiology: 0.38 years; Chat GPT-4o: 1.25 years; Grok 2: 1.40 years; O1-reasoning: 1.45 years; and Claude 3.5 Sonnet: 2.18 years. All LLMs performed significantly worse than Radiology (p &lt; 0.001). Significant differences were also observed among the methods (Radiology, Chat GPT-4o, Claude 3.5 Sonnet, O1-reasoning, Grok 2) (p &lt; 0.001). While Chat GPT-4o exhibited the best performance among LLMs, it remained inferior to human experts. Conclusions: Despite recent advancements in AI, LLMs currently lack the accuracy required for clinical bone age assessment compared to radiologists or endocrinologists. These findings underscore the need for further refinement of AI models before their safe integration into clinical workflows or patient use. Although platforms like Grok show promise, their application should remain exploratory until substantial improvements are achieved. Presentation: Saturday, July 12, 2025

Autoren

Joshua Tarkoff

Institutionen

Miami Children's Hospital(US)

Themen

Radiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

SAT-227 Evaluation of Large Language Models in Bone Age Assessment: Keep Your Radiologist

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen