OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.04.2026, 06:09

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Comparative Evaluation of GPT-4 Turbo and Gemini-Pro in Medical Licensing Exams: Enhancing Artificial Intelligence's Role in Medical Education

2026·0 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2026

Jahr

Abstract

Background and objective Large language models (LLMs) are increasingly being explored as adjuncts to medical education; however, comparative data on the performance and error patterns of newer models on standardized licensing-style questions remain limited. This study evaluated two advanced large language models (LLMs) - Gemini-Pro and GPT-4 Turbo - on the National Board of Medical Examiners (NBME) Step 1-style multiple-choice questions to assess accuracy, reasoning quality, and common failure modes relevant to exam preparation and clinical reasoning training. Methods A total of 112 NBME Step 1 questions were collected; seven image- or table-dependent items were excluded, yielding 105 text-only questions. Prompts were standardized to include the clinical stem, query, and answer choices and were submitted via Python API to Gemini-Pro and GPT-4 Turbo. Outputs were independently adjudicated by two third-year medical students and one board-certified physician using binary accuracy scoring and structured evaluation of reasoning features (logical reasoning, internal information use, and external knowledge application). Incorrect responses were categorized as logical, informational, or statistical errors. Comparative analyses included raw accuracy calculations and chi-square testing of reasoning-feature distributions. Results GPT-4 Turbo achieved 90.99% accuracy on the January 2024 NBME Step 1 question set, substantially outperforming Gemini-Pro (54.46%). GPT-4 Turbo demonstrated fewer errors overall, with lower logical (16%), informational (4%), and statistical (4%) error rates compared with older baselines reported in the study (e.g., GPT-3.5 logical errors 42%). GPT-4 Turbo incorporated external information in 76% of correct responses versus 25% for Gemini-Pro, and differences in performance metrics between GPT-4 Turbo and Gemini-Pro were statistically significant (p < 0.05). Conclusions GPT-4 Turbo markedly outperformed Gemini-Pro on text-based NBME Step 1 questions, showing higher accuracy, stronger reasoning consistency, and fewer logical/informational failures. These findings support GPT-4 Turbo's potential role as a high-yield supplementary tool for Step 1-style learning and feedback, while underscoring the need for continued refinement and cautious, supervised integration of LLMs into medical education, given persistent (though reduced) error rates.

Ähnliche Arbeiten