Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications
1
Zitationen
2
Autoren
2026
Jahr
Abstract
BACKGROUND: Large language models (LLMs) now achieve performance comparable to senior postgraduate trainees on orthopaedic examinations and are increasingly trusted by clinicians to provide explanations for educational and decision-support purposes. However, correct answers do not necessarily indicate sound reasoning or reliable referencing. Current evaluations in this field emphasise accuracy alone, ignoring the quality and evidentiary reliability of the reasoning process. AIM: This study aimed to systematically evaluate the relationship between answer accuracy, reasoning quality, and reference reliability in the latest generation of LLMs applied to a standardised postgraduate orthopaedic examination. METHODS: The 2024 Orthopaedic In-Training Examination (OITE; 203 questions) was administered to GPT-5 (OpenAI, San Francisco, CA, USA). The model was prompted to provide one answer, a brief rationale, and one supporting reference per question. Accuracy and percentile were recorded relative to official American Academy of Orthopaedic Surgeons (AAOS) data. A structured subsample of 88 responses (44 correct, 44 incorrect) underwent detailed validation of referencing and reasoning. GPT-5's reasoning was compared against official AAOS answer explanations for each question. Reasoning quality was scored using a three-point ordinal scale. References were categorised as fabricated, misrepresented, or accurate. Hallucination rates and reasoning scores were compared between correct and incorrect answers. RESULTS: GPT-5 achieved 78.3% accuracy (159/203), exceeding the OITE pass threshold (67%) and the mean postgraduate year-5 (PGY-5) resident score (73%), the highest accuracy reported, to our knowledge, among peer-reviewed studies to date. In the subset of 88 responses, hallucinations occurred in 33% overall, significantly higher in incorrect (50%) than in correct answers (15.9%; p=0.001). Reasoning among correct answers was consistently high (median 2.0, IQR 0.0), with 95.5% scoring maximum points, indicating reasoning entirely concordant with the reasoning provided by AAOS. Image-based questions showed lower accuracy (44.7%) compared with text-based questions (54%), though this difference was not statistically significant (p=0.52). CONCLUSIONS: GPT-5 appears to exceed previously reported LLM performance on the OITE and achieved accuracy higher than published mean scores for senior trainees, but demonstrated poor reference reliability, with one in three answers citing fabricated or misrepresented evidence. Even correct answers frequently relied on flawed or unverifiable sources. Evaluation of LLMs in medical education should incorporate systematic reasoning and evidence validation, not accuracy alone.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.