Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information

2026·0 Zitationen·BMC Medical Informatics and Decision MakingOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

To explore the performance of large language models (LLMs) in surgical decision-making for lumbar disc herniation (LDH), and to evaluate the impact of radiology report text and manually summarized clinical information on model decision outputs. A total of 48 LDH cases from multiple centers were included. Four mainstream LLMs (GPT-5, Gemini 2.5 Pro, DeepSeek-R1, and Grok-4) were used to perform a binary classification task (surgical vs. conservative treatment). Two input scenarios were designed: Group A used radiology report text only, while Group B incorporated additional manually summarized clinical information based on the same reports. Primary performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score. Cohen’s kappa was reported as a supplementary measure of agreement. Decision confidence was further analyzed using stratified analysis. Using radiology report text alone, GPT-5 demonstrated relatively strong diagnostic performance, with a sensitivity of 0.92, specificity of 0.58, and accuracy of 0.75. After incorporating clinical information, its accuracy increased to 0.85, with improvements observed in specificity, PPV, NPV, and F1 score. Gemini and Grok also showed performance improvement following the addition of clinical information, whereas DeepSeek-R1 exhibited minimal change across input scenarios. McNemar’s test indicated that only Gemini showed a statistically significant difference between the two groups (P = 0.013). Confidence analysis showed that the inclusion of clinical information increased the coverage of high-confidence predictions in most models; however, the alignment between high-confidence outputs and actual clinical decisions varied across models. This exploratory study suggests that adding clinical information, such as symptoms, disease duration, and prior treatment, to radiology report text may help some LLMs produce outputs that are more consistent with actual clinical decisions in LDH. However, the findings are limited by the small sample size, the quality of the input data, and the complexity of real clinical decision-making. Further validation in larger studies with more complete information is still needed.

Autoren

Institutionen

Themen

Medical Imaging and AnalysisSpine and Intervertebral Disc PathologyArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen