Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination

2026·0 Zitationen·Orthopedic ReviewsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background: Previous studies evaluating large language models (LLMs) on the Orthopaedic In-Training Examination (OITE) have primarily focused on earlier-generation models and single-pass accuracy. These investigations did not assess newer multimodal systems such as GPT-5 and Gemini 2.5 Pro, nor did they examine the reasoning quality underlying model responses or the consistency of outputs across repeated trials. As LLMs are increasingly used as educational tools, a more comprehensive evaluation framework is needed to assess not only correctness but also reliability and explanatory validity on specialty-specific, image-rich examinations. Methods: We conducted a controlled, parallel evaluation of GPT-5 and Gemini 2.5 Pro using 412 OITE-style questions from the 2023-2024 examination cycle obtained via an institutional AAOS ResStudy subscription. Primary outcomes included overall and subspecialty-specific accuracy. Secondary analyses evaluated explanatory quality, error-pattern classification, response consistency across repeated trials, and performance stratified by imaging burden. Paired accuracy was compared using McNemar's exact test. Results: Gemini 2.5 Pro demonstrated higher overall accuracy than GPT-5 on the 2023-2024 OITE question set (81.1% vs 76.0), with both models exceeding published PGY-5 resident benchmarks. Accuracy declined significantly with questions containing images (74.2% vs 71.6%). Subspecialty performance varied widely, with accuracy ranging from 42.9% to 94.1% for GPT-5 and from 57.1% to 95.8% for Gemini, and both models performing poorest in Hand and Wrist questions. Among incorrect responses, faulty reasoning accounted for 52.5% of GPT-5 errors, whereas stem misinterpretation was the predominant error for Gemini (43.6%). Incorrect or partially correct explanations accompanied 45.4% of GPT-5 and 41.7% of Gemini responses. Consistency testing showed high reproducibility (fully consistent responses: 88% for GPT-5 and 84% for Gemini), with all inconsistent outputs occurring in image-containing questions. Conclusions: GPT-5 and Gemini 2.5 Pro demonstrate strong performance on recent OITE content, exceeding prior LLM benchmarks; however, persistent limitations in multimodal reasoning, explanatory reliability, and response consistency indicate that high accuracy alone does not ensure dependable clinical reasoning, underscoring the need for cautious educational use.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationDiversity and Career in MedicineSurgical Simulation and Training

Volltext beim Verlag öffnen

Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen