Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language models in sports injury care: a comparative expert evaluation of GPT-4o and GPT-5
0
Zitationen
6
Autoren
2026
Jahr
Abstract
Large language models (LLMs) have shown increasing relevance in clinically supervised decision-support frameworks; however, their performance in orthopedic sports injury scenarios remains unclear. This study aimed to comparatively evaluate the diagnostic, treatment, and rehabilitation recommendations generated by GPT-4o and GPT-5 using standardized clinical scenarios assessed by orthopedic specialists. Fifteen sports injury–based clinical scenarios were developed and validated by orthopedic specialists with subspecialty expertise in sports traumatology. Each scenario was scored for clinical realism, adequacy of physical examination findings, and adequacy of radiological information using a 7-point Likert scale adapted from AGREE II domains. Both GPT-4o and GPT-5 were prompted using standardized zero-shot instructions, with each scenario submitted three times to assess internal consistency. Two blinded orthopedic specialists evaluated content-level consistency, and five independent orthopedic specialists scored the expert-rated clinical adequacy of AI-generated responses on a 0–5 scale. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) and Cohen’s kappa. Specialists rated the clinical scenarios favorably, with 69–72% agreement across domains and ICC values indicating good reliability for clinical realism (ICC = 0.725) and moderate reliability for physical examination (ICC = 0.634) and radiological adequacy (ICC = 0.512). GPT-4o produced consistent outputs in 93.3% of cases, with one scenario showing clinically relevant inconsistency (κ = 0.82). Comparative expert evaluation demonstrated significantly higher scores for GPT-5 (median = 4.60) than GPT-4o (median = 4.00) (p = 0.007). Inter-rater reliability for AI response scoring was high for both models (ICC = 0.888 for GPT-4o; ICC = 0.895 for GPT-5). GPT-4o and GPT-5 generated responses with generally high expert-rated clinical adequacy and strong consistency in standardized sports injury–related clinical scenarios, with GPT-5 achieving higher scores in expert evaluations. By providing a structured, specialty-specific expert assessment under controlled conditions, this study adds comparative insight into how contemporary large language models are perceived in orthopedic sports injury contexts, without implying objective diagnostic accuracy or autonomous clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.391 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.257 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.685 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.501 Zit.