Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models in sports injury care: a comparative expert evaluation of GPT-4o and GPT-5

2026·0 Zitationen·BMC Sports Science Medicine and RehabilitationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models (LLMs) have shown increasing relevance in clinically supervised decision-support frameworks; however, their performance in orthopedic sports injury scenarios remains unclear. This study aimed to comparatively evaluate the diagnostic, treatment, and rehabilitation recommendations generated by GPT-4o and GPT-5 using standardized clinical scenarios assessed by orthopedic specialists. Fifteen sports injury–based clinical scenarios were developed and validated by orthopedic specialists with subspecialty expertise in sports traumatology. Each scenario was scored for clinical realism, adequacy of physical examination findings, and adequacy of radiological information using a 7-point Likert scale adapted from AGREE II domains. Both GPT-4o and GPT-5 were prompted using standardized zero-shot instructions, with each scenario submitted three times to assess internal consistency. Two blinded orthopedic specialists evaluated content-level consistency, and five independent orthopedic specialists scored the expert-rated clinical adequacy of AI-generated responses on a 0–5 scale. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) and Cohen’s kappa. Specialists rated the clinical scenarios favorably, with 69–72% agreement across domains and ICC values indicating good reliability for clinical realism (ICC = 0.725) and moderate reliability for physical examination (ICC = 0.634) and radiological adequacy (ICC = 0.512). GPT-4o produced consistent outputs in 93.3% of cases, with one scenario showing clinically relevant inconsistency (κ = 0.82). Comparative expert evaluation demonstrated significantly higher scores for GPT-5 (median = 4.60) than GPT-4o (median = 4.00) (p = 0.007). Inter-rater reliability for AI response scoring was high for both models (ICC = 0.888 for GPT-4o; ICC = 0.895 for GPT-5). GPT-4o and GPT-5 generated responses with generally high expert-rated clinical adequacy and strong consistency in standardized sports injury–related clinical scenarios, with GPT-5 achieving higher scores in expert evaluations. By providing a structured, specialty-specific expert assessment under controlled conditions, this study adds comparative insight into how contemporary large language models are perceived in orthopedic sports injury contexts, without implying objective diagnostic accuracy or autonomous clinical decision-making.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationGenomics and Rare DiseasesTopic Modeling

Volltext beim Verlag öffnen

Large language models in sports injury care: a comparative expert evaluation of GPT-4o and GPT-5

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen