OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.04.2026, 10:04

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Large language models are increasingly integrated into medical education, yet their performance in non-English clinical examinations, particularly Persian, remains limited. This study evaluated how GPT-4o and GPT-5.1 perform on Iranian Rheumatology Board examination questions. A total of 204 multiple-choice items were administered in Persian using a similar prompt. Accuracy was determined using the official answer key, and six board-certified rheumatologists independently scored each model’s clinical reasoning on a 1–5 scale. GPT-5.1 demonstrated markedly superior performance, achieving 76% accuracy compared with 64.5% for GPT-4o, alongside significantly higher reasoning scores. Unlike GPT-4o, which showed considerable variability across question types, GPT-5.1 performed consistently across basic science, clinical scenarios, diagnosis, and treatment domains. Although inter-rater agreement among rheumatologists was modest, it remained statistically significant. These findings suggest that newer-generation LLMs provide more reliable reasoning and accuracy in Persian medical assessments. Nevertheless, despite their promising role as educational aids, current models are not yet suitable for high-stakes clinical decision-making and require continued evaluation across diverse languages and specialties.

Ähnliche Arbeiten