OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 11.04.2026, 23:03

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can Artificial Intelligence Align with Evidence? Performance of ChatGPT-4o in Knee Osteoarthritis Surgical Guidelines

2026·0 Zitationen·The Journal of Knee Surgery
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

Artificial intelligence large language models (LLMs) such as ChatGPT are increasingly used in clinical settings, yet their reliability in reproducing evidence-based recommendations remains uncertain. This study aimed to evaluate the performance of ChatGPT-4o in addressing clinical practice guideline (CPG) recommendations for the surgical management of knee osteoarthritis and total knee arthroplasty (TKA). An observational cross-sectional design was conducted in September 2025. Twenty recommendations from the most recent American Academy of Orthopaedic Surgeons CPG on TKA were translated into structured clinical questions and submitted to ChatGPT-4o. Each query was entered three times in independent sessions to evaluate textual consistency. Two independent reviewers with expertise in musculoskeletal physiotherapy and orthopedics appraised the chatbot's answers, classifying them according to the CPG framework ("should do," "could do," "do not do," "uncertain"). Agreement between reviewers and alignment with CPG recommendations were assessed using Cohen's and Fleiss' Kappa coefficients. ChatGPT-4o achieved an overall concordance of 60% with the CPG recommendations, representing fair agreement (κ = 0.392, <i>p</i> = 0.005). Internal text consistency across repeated trials was low, with several responses showing unacceptable similarity levels (<50%). Inter-rater reliability ranged from moderate to perfect (κ = 0.547-0.946). Although ChatGPT-4o provided clinically acceptable answers in several domains, discrepancies persisted, particularly in recommendations regarding functional outcomes and rehabilitation strategies. ChatGPT-4o demonstrated moderate accuracy and heterogeneous reliability when reproducing CPG recommendations for TKA. While the model may serve as a supportive tool for education and patient communication, its variability and incomplete adherence to guidelines highlight the need for cautious integration and professional oversight in clinical decision-making.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationSocial Media in Health EducationClinical Reasoning and Diagnostic Skills
Volltext beim Verlag öffnen