OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 31.03.2026, 15:38

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking large language models against human experts in rehabilitation medicine: a multidimensional evaluation

2026·0 Zitationen·Journal of NeuroEngineering and RehabilitationOpen Access
Volltext beim Verlag öffnen

0

Zitationen

17

Autoren

2026

Jahr

Abstract

Rehabilitation medicine faces a significant challenge due to the rising demand for services coupled with a shortage of specialized professionals. Large Language Models (LLMs) show promise for enhancing clinical efficiency, but their evaluation has been largely limited to simulated scenarios, lacking direct performance comparisons with human experts in complex, real-world clinical tasks. To systematically benchmark five state-of-the-art LLMs against senior physiatrists in formulating comprehensive rehabilitation plans for authentic clinical cases, evaluating their utility as clinical decision support tools. We conducted a rigorous, blinded evaluation using 48 authentic cases across six subspecialties. Plans generated by five LLMs (Grok-4, Gemini−2.5-pro, ChatGPT-5-2025-08-07, Deepseek-r1-0528, and Claude-opus-4-20250514) were compared with expert-authored plans. A panel of 6 senior physiatrists evaluated the plans using a multi-dimensional framework covering four key domains: Clinical Applicability and Safety (primary safety endpoint), Scientific Rigor, Individualization, and Clarity. To address the data’s hierarchical structure, we employed Linear Mixed-Effects Models (LMM) with random intercepts for cases and raters, and fixed effects for models and language. Pairwise comparisons were adjusted using the Holm-Bonferroni correction. Quantitative analysis revealed that Grok-4 (mean 4.31) and Gemini−2.5-pro (mean 4.14) significantly outperformed the human benchmark (derived from standardized expert solutions) (mean 3.56; $$P<0.001$$ ). Notably, the open-source Deepseek-r1 (mean 3.69) also achieved a statistically significant advantage over experts ( $$P<0.001$$ ). Conversely, human experts scored numerically higher than Claude-opus-4 (mean 3.50), though this difference was not statistically significant ( $$P=0.099$$ ). Qualitative analysis further highlighted human experts’ distinct strengths in strategic pathway design and humanistic care. Top-tier LLMs demonstrate capability in generating high-quality, evidence-based plans, positioning them as effective “executors” for drafting preliminary regimens. We propose a human-AI collaboration paradigm where experts function as “strategists,” focusing on optimization and humanistic care to elevate rehabilitation service quality.

Ähnliche Arbeiten