Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

When AI models take the exam: large language models vs medical students on multiple-choice course exams

2025·4 Zitationen·Medical Education OnlineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Large language models (LLMs) are increasingly used in healthcare and medical education, but their performance on institution-authored multiple-choice questions (MCQs), particularly with negative marking, remains unclear. To compare the examination performance of five contemporary LLMs with enrolled medical students on final multiple-choice (MCQ-style) course exams across four clinical courses. We conducted a comparative cross-sectional study at Miguel Hernández University (Spain) in 2025. Final exams in Infectious Diseases, Neurology, Respiratory Medicine, and Cardiovascular Medicine were administered under routine conditions in Spanish. Five LLMs (OpenAI o1, GPT-4o, DeepSeek R1, Microsoft Copilot, and Google Gemini 1.5 Flash) completed all MCQs in two independent runs. Scores were averaged and test-retest was estimated with Gwet's AC1. Student scores (<i>n</i> = 442) were summarized as mean ± SD or median (IQR). Pairwise differences between models were explored with McNemar's test; student-LLM contrasts were descriptive. Across courses, LLMs consistently exceeded the student median and, in several instances, the highest student score. Mean LLM courses scores ranged 7.46-9.88, versus student means 4.28-7.32. OpenAI o1 achieved the highest mean in three courses; Copilot led in Cardiovascular Medicine (text-only subset due to image limitations). All LLMs answered every MCQ and short term test-retest agreement was high (AC1 0.79-1.00). Aggregated across courses, LLMs averaged 8.75 compared with 5.76 for students. On department-set Spanish MCQ exams with negative marking, LLMs outperformed enrolled medical students, answered every item, and showed high short-term reproducibility. These findings support cautious, faculty-supervised use of LLMs as adjuncts to MCQ assessment (e.g. automated pretesting, feedback). Confirmation across institutions, languages, and image-rich formats, and evaluation of educational impact beyond accuracy are needed.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationDiversity and Career in MedicineSimulation-Based Education in Healthcare

Volltext beim Verlag öffnen

When AI models take the exam: large language models vs medical students on multiple-choice course exams

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen