Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Multimodal Large Language Model for Fracture Detection in Emergency Orthopedic Trauma: A Diagnostic Accuracy Study
0
Zitationen
5
Autoren
2026
Jahr
Abstract
<b>Background</b>: Rapid and accurate fracture detection is critical in emergency departments (EDs), where high patient volume and time pressure increase the risk of diagnostic error, particularly in radiographic interpretation. Multimodal large language models (LLMs) with image-recognition capability have recently emerged as general-purpose tools for clinical decision support, but their diagnostic performance within routine emergency department imaging workflows in orthopedic trauma remains unclear. <b>Methods</b>: In this retrospective diagnostic accuracy study, we included 1136 consecutive patients referred from the ED to orthopedics between 1 January and 1 June 2025 at a single tertiary center. Given the single-center, retrospective design, the findings should be interpreted as hypothesis-generating and may not be fully generalizable to other institutions. Emergency radiographs and clinical data were processed by a multimodal LLM (2025 version) via an official API using a standardized, deterministic prompt. The model's outputs ("Fracture present", "No fracture", or "Uncertain") were compared with final diagnoses established by blinded orthopedic specialists, which served as the reference standard. Diagnostic agreement was analyzed using Cohen's kappa (κ), sensitivity, specificity, accuracy, and 95% confidence intervals (CIs). False-negative (FN) cases were defined as instances where the LLM reported "no acute fracture" but the specialist identified a fracture. The evaluated system is a general-purpose multimodal LLM and was not trained specifically on orthopedic radiographs. <b>Results</b>: Overall, the LLM showed good diagnostic agreement with orthopedic specialists, with concordant results in 808 of 1136 patients (71.1%; κ = 0.634; 95% CI: 68.4-73.7). The model achieved balanced performance with sensitivity of 76.9% and specificity of 66.8%. The highest agreement was observed in knee trauma (91.7%), followed by wrist (78.8%) and hand (69.6%). False-negative cases accounted for 184 patients (16.2% of the total cohort), representing 32.4% of all LLM-negative assessments. Most FN fractures were non-displaced (82.6%), and 17.4% of FN cases required surgical treatment. Ankle and foot regions showed the highest FN rates (30.4% and 17.4%, respectively), reflecting the anatomical and radiographic complexity of these areas. Positive predictive value (PPV) and negative predictive value (NPV) were 69.4% and 74.5%, respectively, with likelihood ratios indicating moderate shifts in post-test probability. <b>Conclusions</b>: In an emergency department-to-orthopedics consultation cohort reflecting routine clinical workflow, a multimodal LLM demonstrated moderate-to-good diagnostic agreement with orthopedic specialists, broadly within the range reported in prior fracture-detection AI studies; however, these comparisons are indirect because model architectures, training strategies, datasets, and endpoints differ across studies. However, its limited ability to detect non-displaced fractures-especially in anatomically complex regions like the ankle and foot-carries direct patient safety implications and confirms that specialist review remains indispensable. At present, such models may be explored as hypothesis-generating triage or decision-support tools, with mandatory specialist confirmation, rather than as standalone diagnostic systems. Prospective, multi-center studies using high-resolution imaging and anatomically optimized algorithms are needed before routine clinical adoption in emergency care.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.