Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking large language models against human experts in rehabilitation medicine: a multidimensional evaluation
0
Zitationen
17
Autoren
2026
Jahr
Abstract
Rehabilitation medicine faces a significant challenge due to the rising demand for services coupled with a shortage of specialized professionals. Large Language Models (LLMs) show promise for enhancing clinical efficiency, but their evaluation has been largely limited to simulated scenarios, lacking direct performance comparisons with human experts in complex, real-world clinical tasks. To systematically benchmark five state-of-the-art LLMs against senior physiatrists in formulating comprehensive rehabilitation plans for authentic clinical cases, evaluating their utility as clinical decision support tools. We conducted a rigorous, blinded evaluation using 48 authentic cases across six subspecialties. Plans generated by five LLMs (Grok-4, Gemini−2.5-pro, ChatGPT-5-2025-08-07, Deepseek-r1-0528, and Claude-opus-4-20250514) were compared with expert-authored plans. A panel of 6 senior physiatrists evaluated the plans using a multi-dimensional framework covering four key domains: Clinical Applicability and Safety (primary safety endpoint), Scientific Rigor, Individualization, and Clarity. To address the data’s hierarchical structure, we employed Linear Mixed-Effects Models (LMM) with random intercepts for cases and raters, and fixed effects for models and language. Pairwise comparisons were adjusted using the Holm-Bonferroni correction. Quantitative analysis revealed that Grok-4 (mean 4.31) and Gemini−2.5-pro (mean 4.14) significantly outperformed the human benchmark (derived from standardized expert solutions) (mean 3.56; $$P<0.001$$ ). Notably, the open-source Deepseek-r1 (mean 3.69) also achieved a statistically significant advantage over experts ( $$P<0.001$$ ). Conversely, human experts scored numerically higher than Claude-opus-4 (mean 3.50), though this difference was not statistically significant ( $$P=0.099$$ ). Qualitative analysis further highlighted human experts’ distinct strengths in strategic pathway design and humanistic care. Top-tier LLMs demonstrate capability in generating high-quality, evidence-based plans, positioning them as effective “executors” for drafting preliminary regimens. We propose a human-AI collaboration paradigm where experts function as “strategists,” focusing on optimization and humanistic care to elevate rehabilitation service quality.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.
Autoren
Institutionen
- First Affiliated Hospital of University of South China(CN)
- University of South China(CN)
- Hong Kong Polytechnic University(HK)
- Nanjing Medical University(CN)
- Zhujiang Hospital(CN)
- Southern Medical University(CN)
- Sichuan University(CN)
- West China Hospital of Sichuan University(CN)
- First Affiliated Hospital of Zhengzhou University(CN)
- Suzhou Vocational Health College(CN)
- University of Electronic Science and Technology of China(CN)
- Shanghai Jiao Tong University(CN)