Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information
0
Zitationen
8
Autoren
2026
Jahr
Abstract
To explore the performance of large language models (LLMs) in surgical decision-making for lumbar disc herniation (LDH), and to evaluate the impact of radiology report text and manually summarized clinical information on model decision outputs. A total of 48 LDH cases from multiple centers were included. Four mainstream LLMs (GPT-5, Gemini 2.5 Pro, DeepSeek-R1, and Grok-4) were used to perform a binary classification task (surgical vs. conservative treatment). Two input scenarios were designed: Group A used radiology report text only, while Group B incorporated additional manually summarized clinical information based on the same reports. Primary performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score. Cohen’s kappa was reported as a supplementary measure of agreement. Decision confidence was further analyzed using stratified analysis. Using radiology report text alone, GPT-5 demonstrated relatively strong diagnostic performance, with a sensitivity of 0.92, specificity of 0.58, and accuracy of 0.75. After incorporating clinical information, its accuracy increased to 0.85, with improvements observed in specificity, PPV, NPV, and F1 score. Gemini and Grok also showed performance improvement following the addition of clinical information, whereas DeepSeek-R1 exhibited minimal change across input scenarios. McNemar’s test indicated that only Gemini showed a statistically significant difference between the two groups (P = 0.013). Confidence analysis showed that the inclusion of clinical information increased the coverage of high-confidence predictions in most models; however, the alignment between high-confidence outputs and actual clinical decisions varied across models. This exploratory study suggests that adding clinical information, such as symptoms, disease duration, and prior treatment, to radiology report text may help some LLMs produce outputs that are more consistent with actual clinical decisions in LDH. However, the findings are limited by the small sample size, the quality of the input data, and the complexity of real clinical decision-making. Further validation in larger studies with more complete information is still needed.
Ähnliche Arbeiten
A survey on deep learning in medical image analysis
2017 · 13.880 Zit.
nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation
2020 · 8.032 Zit.
Calculation of average PSNR differences between RD-curves
2001 · 4.093 Zit.
Magnetic Resonance Classification of Lumbar Intervertebral Disc Degeneration
2001 · 3.927 Zit.
Vertebral fracture assessment using a semiquantitative technique
1993 · 3.627 Zit.
Autoren
Institutionen
- Hunan University of Traditional Chinese Medicine(CN)
- Luoyang Orthopedic-Traumatological Hospital of Henan Province(CN)
- First Affiliated Hospital of Hunan University of Traditional Chinese Medicine(CN)
- Second Affiliated Hospital of Hunan University of Traditional Chinese Medicine(CN)
- First Affiliated Hospital of Henan University(CN)
- Henan University of Traditional Chinese Medicine(CN)
- Artificial Intelligence in Medicine (Canada)(CA)