OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 19:25

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can Generative Artificial Intelligence Reliably Score Open-Ended Question Assessments in Undergraduate Medical Education?

2026·0 Zitationen·Medical Science EducatorOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

There are numerous benefits to including open-ended questions (OEQs) in the assessment of medical knowledge, but one of the biggest challenges is the time it takes to grade student responses. With the widescale introduction of generative artificial intelligence (AI), it is plausible that OEQ exams can be automatically scored. The purpose of this study was to establish the accuracy of generative AI when scoring medical student OEQ exams. Students’ responses from OEQs administered at two US allopathic medical schools were analyzed. Case vignettes, questions, rubrics, and student responses were fed into the GPT-4 model via the Med2Lab platform. The Med2Lab system was specifically engineered to manage rubric integration and automate prompt workflows. Scores and feedback on students’ responses were generated and compared to faculty scores using Cohen’s weighted kappa (kw) to evaluate inter-rater reliability (IRR). An error pattern analysis was performed to assess why there were scoring discrepancies between faculty and GPT-4, then this information was used to perform rubric engineering. We ran 3 iterations of GPT-4 scoring after each rubric adjustment. By the third iteration, IRR between faculty and GPT-4 was substantial using the analytic rubric (question 1A: kw=0.94; question 2A: kw=0.88) and the holistic rubric (question 2H: kw=0.89). IRR for question 1H reached moderate reliability (kw=0.54). We identified errors in GPT-4 and faculty scoring, although score discrepancies were typically only 1-point. Our data suggest that generative AI can be used to reliably score OEQ exams using an iterative process of rubric engineering to achieve maximum reliability.

Ähnliche Arbeiten