Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking Large Language Models on the Taiwan Neurology Board Examinations (2018–2024): A Comparative Evaluation of GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1

2026·0 Zitationen·BioengineeringOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background and Purpose: Neurology requires integration of clinical reasoning, imaging interpretation, and current knowledge, making it an ideal field for evaluating large language models (LLMs). Methods: Using 1715 questions from the Taiwan Neurology Board Examination (2018–2024), we assessed four LLMs—GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1—across four formats: single-choice, multiple-choice, true–false, and image-based items. Results: GPT-o1 achieved the highest overall accuracy (83.86%) and demonstrated strong performance on cognitively demanding tasks (82.50% on true–false; 77.26% on image-based). DeepSeek-V3 scored lowest (65.62%) and showed the greatest variability. Statistical analyses confirmed significant inter-model differences (p < 0.01). Accuracy declined across all models in 2024, coinciding with shifts in question design. DeepSeek-R1 was further penalized by alignment-based refusals, resulting in up to 3.81% score loss. Conclusions: These results position the Taiwan Neurology Board Exam as a rigorous benchmark for LLM evaluation and underscore GPT-o1’s potential utility in neurology education and decision support.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsRadiology practices and education

Volltext beim Verlag öffnen

Benchmarking Large Language Models on the Taiwan Neurology Board Examinations (2018–2024): A Comparative Evaluation of GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen