OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 16:22

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Performance of Large Language Models on a Neurology Board-Style Examination

2023·7 ZitationenOpen Access
Volltext beim Verlag öffnen

7

Zitationen

3

Autoren

2023

Jahr

Abstract

Summary Background and Objectives Recent advancements in large language models (LLMs) such as GPT-3.5 and GPT-4 have shown impressive potential in a wide array of applications, including healthcare. While GPT-3.5 and GPT-4 showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board exams remains unexplored. Methods An exploratory, prospective study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology, designed as part of a self-assessment program. Questions were presented in a single best answer, multiple-choice format. The results from the question bank were validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions. The performance of GPT-3.5 and GPT-4 was assessed in relation to overall performance, question type, and topic. In addition, the confidence level in responses and the reproducibility of correctly and incorrectly answered questions was evaluated. Univariable analysis was carried out. Chi-squared test and Bonferroni correction were used to determine performance differences based on question characteristics. To differentiate characteristics of correctly and incorrectly answered questions, a high-dimensional tSNE analysis of the question representations was performed. Results In May 2023, GPT-3.5 correctly answered 66.8 % of 1956 questions, whereas GPT-4 demonstrated a higher performance level, correctly answering 85 % of questions in congruence with near-passing and passing of the neurology board exam. GPT-4’s performance surpassed both GPT-3.5 and question bank users (mean human user score: 73.8%). An analysis of twenty-six question categories showed that GPT-4 outperformed human users in Behavioral, Cognitive and Psych-related questions and demonstrated superior performance to GPT-3.5 in six categories. Both models performed better on lower-order than higher-order questions according to Bloom Taxonomy for learning and assessment (GPT4: 790 of 893 (88.5%) vs. 872 of 1063 (82%), GPT-3.5: 639 of 893 (71.6%) vs. 667 of 1063 (62.7%)) with GPT-4 also excelling in both lower-order and higher-order questions. The use of confident language was observed consistently across both models, even when incorrect (GPT-4: 99.3%, 292 of 294 incorrect answers, GPT-3.5: 100%, 650 of 650 incorrect answers). Reproducible answers of GPT-3.5 and GPT-4 (defined as more than 75 % same output across 50 independent queries) were associated with a higher percentage of correct answers (GPT-3.5: 66 of 88 (75%), GPT-4: 78 of 96 (81.3%)) than inconsistent answers, (GPT-3.5: 5 of 13 (38.5%), GPT-4: 1 of 4 (25%)). Lastly, the high-dimensional embedding analysis of correctly and incorrectly answered questions revealed no clear differentiation into distinct clusters. Discussion Despite the absence of neurology-specific training, GPT-4 demonstrated commendable performance, whereas GPT-3.5 performed slightly below the human average question bank user. Higher-order cognitive tasks proved more challenging for both GPT-4 and GPT-3.5. Notwithstanding, GPT-4’s performance was equivalent to a passing grade for specialized neurology board exams. These findings suggest that with further refinements, LLMs like GPT-4 could play a pivotal role in applications for clinical neurology and healthcare in general.

Ähnliche Arbeiten