Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Large language models are increasingly integrated into medical education, yet their performance in non-English clinical examinations, particularly Persian, remains limited. This study evaluated how GPT-4o and GPT-5.1 perform on Iranian Rheumatology Board examination questions. A total of 204 multiple-choice items were administered in Persian using a similar prompt. Accuracy was determined using the official answer key, and six board-certified rheumatologists independently scored each model’s clinical reasoning on a 1–5 scale. GPT-5.1 demonstrated markedly superior performance, achieving 76% accuracy compared with 64.5% for GPT-4o, alongside significantly higher reasoning scores. Unlike GPT-4o, which showed considerable variability across question types, GPT-5.1 performed consistently across basic science, clinical scenarios, diagnosis, and treatment domains. Although inter-rater agreement among rheumatologists was modest, it remained statistically significant. These findings suggest that newer-generation LLMs provide more reliable reasoning and accuracy in Persian medical assessments. Nevertheless, despite their promising role as educational aids, current models are not yet suitable for high-stakes clinical decision-making and require continued evaluation across diverse languages and specialties.
Ähnliche Arbeiten
The Strengths and Difficulties Questionnaire: A Research Note
1997 · 14.616 Zit.
Making sense of Cronbach's alpha
2011 · 13.874 Zit.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies
2011 · 13.674 Zit.
A method for estimating the probability of adverse drug reactions
1981 · 11.493 Zit.
Evidence-Based Medicine
1992 · 4.155 Zit.