Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the Performance of Large Language Models on a Neurology Board-Style Examination
7
Zitationen
3
Autoren
2023
Jahr
Abstract
Summary Background and Objectives Recent advancements in large language models (LLMs) such as GPT-3.5 and GPT-4 have shown impressive potential in a wide array of applications, including healthcare. While GPT-3.5 and GPT-4 showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board exams remains unexplored. Methods An exploratory, prospective study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology, designed as part of a self-assessment program. Questions were presented in a single best answer, multiple-choice format. The results from the question bank were validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions. The performance of GPT-3.5 and GPT-4 was assessed in relation to overall performance, question type, and topic. In addition, the confidence level in responses and the reproducibility of correctly and incorrectly answered questions was evaluated. Univariable analysis was carried out. Chi-squared test and Bonferroni correction were used to determine performance differences based on question characteristics. To differentiate characteristics of correctly and incorrectly answered questions, a high-dimensional tSNE analysis of the question representations was performed. Results In May 2023, GPT-3.5 correctly answered 66.8 % of 1956 questions, whereas GPT-4 demonstrated a higher performance level, correctly answering 85 % of questions in congruence with near-passing and passing of the neurology board exam. GPT-4’s performance surpassed both GPT-3.5 and question bank users (mean human user score: 73.8%). An analysis of twenty-six question categories showed that GPT-4 outperformed human users in Behavioral, Cognitive and Psych-related questions and demonstrated superior performance to GPT-3.5 in six categories. Both models performed better on lower-order than higher-order questions according to Bloom Taxonomy for learning and assessment (GPT4: 790 of 893 (88.5%) vs. 872 of 1063 (82%), GPT-3.5: 639 of 893 (71.6%) vs. 667 of 1063 (62.7%)) with GPT-4 also excelling in both lower-order and higher-order questions. The use of confident language was observed consistently across both models, even when incorrect (GPT-4: 99.3%, 292 of 294 incorrect answers, GPT-3.5: 100%, 650 of 650 incorrect answers). Reproducible answers of GPT-3.5 and GPT-4 (defined as more than 75 % same output across 50 independent queries) were associated with a higher percentage of correct answers (GPT-3.5: 66 of 88 (75%), GPT-4: 78 of 96 (81.3%)) than inconsistent answers, (GPT-3.5: 5 of 13 (38.5%), GPT-4: 1 of 4 (25%)). Lastly, the high-dimensional embedding analysis of correctly and incorrectly answered questions revealed no clear differentiation into distinct clusters. Discussion Despite the absence of neurology-specific training, GPT-4 demonstrated commendable performance, whereas GPT-3.5 performed slightly below the human average question bank user. Higher-order cognitive tasks proved more challenging for both GPT-4 and GPT-3.5. Notwithstanding, GPT-4’s performance was equivalent to a passing grade for specialized neurology board exams. These findings suggest that with further refinements, LLMs like GPT-4 could play a pivotal role in applications for clinical neurology and healthcare in general.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.