OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.04.2026, 01:10

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large Language Models in Spine Surgery : A Narrative Review of Performance Paradox and Clinical Integration Challenges

2026·0 Zitationen·Journal of Korean Neurosurgical SocietyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

To provide a narrative synthesis of the performance paradox in large language model (LLM) applications for spine surgery, examining the disparity between technical metrics and clinical utility. A narrative review was conducted examining literature from January 2023 to February 2026 across PubMed, EMBASE, and Google Scholar. Studies evaluating LLM applications in spine surgery were included, with emphasis on newer models (GPT-4o, GPT-5, Claude, Gemini variants, DeepSeek). Studies were thematically analyzed across clinical documentation, patient communication, and surgical decision-making domains. Analysis of 42 studies revealed a consistent pattern across applications. LLMs performed strongly in structured documentation tasks, including CPT coding (AUROC 0.87) and surgical classification (91% accuracy), and improved readability in patient-facing materials. Patient communication achieved high satisfaction rates but demonstrated limited emotional intelligence. In contrast, decision-making performance was more variable: in small vignette-based comparisons, LLMs showed lower raw accuracy than attending spine surgeons in complex scenarios, and guideline concordance ranged from 33% to 88% across models. Emerging evidence with next-generation models suggests incremental gains, but procedure-level agreement remains limited (κ=0.415 and 0.587 in minimally invasive spine surgery triage). Image-based tasks, such as Cobb angle measurement, remain particularly challenging, with all tested models failing to meet the ≤10° clinical threshold. LLMs show near-term utility in standardized text-based tasks, but current evidence does not support autonomous use in complex clinical decision-making or image-based spinal assessment. Staged implementation with mandatory human oversight remains necessary.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Medical Imaging and AnalysisArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare
Volltext beim Verlag öffnen