Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications

2026·1 Zitationen·CureusOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

BACKGROUND: Large language models (LLMs) now achieve performance comparable to senior postgraduate trainees on orthopaedic examinations and are increasingly trusted by clinicians to provide explanations for educational and decision-support purposes. However, correct answers do not necessarily indicate sound reasoning or reliable referencing. Current evaluations in this field emphasise accuracy alone, ignoring the quality and evidentiary reliability of the reasoning process. AIM: This study aimed to systematically evaluate the relationship between answer accuracy, reasoning quality, and reference reliability in the latest generation of LLMs applied to a standardised postgraduate orthopaedic examination. METHODS: The 2024 Orthopaedic In-Training Examination (OITE; 203 questions) was administered to GPT-5 (OpenAI, San Francisco, CA, USA). The model was prompted to provide one answer, a brief rationale, and one supporting reference per question. Accuracy and percentile were recorded relative to official American Academy of Orthopaedic Surgeons (AAOS) data. A structured subsample of 88 responses (44 correct, 44 incorrect) underwent detailed validation of referencing and reasoning. GPT-5's reasoning was compared against official AAOS answer explanations for each question. Reasoning quality was scored using a three-point ordinal scale. References were categorised as fabricated, misrepresented, or accurate. Hallucination rates and reasoning scores were compared between correct and incorrect answers. RESULTS: GPT-5 achieved 78.3% accuracy (159/203), exceeding the OITE pass threshold (67%) and the mean postgraduate year-5 (PGY-5) resident score (73%), the highest accuracy reported, to our knowledge, among peer-reviewed studies to date. In the subset of 88 responses, hallucinations occurred in 33% overall, significantly higher in incorrect (50%) than in correct answers (15.9%; p=0.001). Reasoning among correct answers was consistently high (median 2.0, IQR 0.0), with 95.5% scoring maximum points, indicating reasoning entirely concordant with the reasoning provided by AAOS. Image-based questions showed lower accuracy (44.7%) compared with text-based questions (54%), though this difference was not statistically significant (p=0.52). CONCLUSIONS: GPT-5 appears to exceed previously reported LLM performance on the OITE and achieved accuracy higher than published mean scores for senior trainees, but demonstrated poor reference reliability, with one in three answers citing fabricated or misrepresented evidence. Even correct answers frequently relied on flawed or unverifiable sources. Evaluation of LLMs in medical education should incorporate systematic reasoning and evidence validation, not accuracy alone.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsMachine Learning in Healthcare

Volltext beim Verlag öffnen

Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen