Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Regionalized Medical Competence of Large Language Models: A Performance Benchmark on the Brazilian ENAMED Examination (Preprint)

2025·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) show potential for clinical decision support, but current evaluations rely heavily on Anglophone benchmarks, limiting their applicability in specific healthcare contexts like Brazil. </sec> <sec> <title>OBJECTIVE</title> To compare the performance of frontier generalist LLMs and a specialized model (Charcot) on the 2026 Brazilian National Examination for Medical Education (ENAMED), assessing accuracy, response times, and collective error patterns. </sec> <sec> <title>METHODS</title> This observational study evaluated ten LLMs, including GPT-5, Gemini 2.5 Pro, and the Brazilian-specialized model Charcot. The models completed the ENAMED 2026 examination (99 valid items) across five independent runs with randomized question and alternative ordering. The primary outcome was mean accuracy compared to the official answer key. Secondary outcomes included Normalized Mean Response Time (NMRT) and Convergence Error (CE)—defined as a collective bias where at least three generalist models consistently selected the same incorrect alternative. Qualitative analysis of rationales was performed for questions exhibiting high convergence or clinical relevance. </sec> <sec> <title>RESULTS</title> Nine models exceeded 85% accuracy. The specialized model, Charcot, achieved the highest mean accuracy (96.96%), significantly outperforming the top generalist models, GPT-5 (94.34%) and Gemini 2.5 Pro (93.94%) (P < .001). Charcot demonstrated superior performance in items requiring knowledge of specific Brazilian guidelines. The CE analysis revealed that generalist models often converged on incorrect answers in domains such as tuberculosis and prenatal care, whereas the specialized model aligned with local protocols. Conversely, model consensus correctly identified an inconsistency in the official answer key regarding indigenous health. No significant correlation was found between response time and global accuracy. </sec> <sec> <title>CONCLUSIONS</title> Domain specialization in the Portuguese language and Brazilian medical context confers a measurable advantage in complex medical tasks, reducing errors derived from training biases present in generalist models. While frontier models demonstrate near-human or superhuman performance on multiple-choice questions, the persistence of collective errors highlights the need for continuous expert supervision. Furthermore, the consensus among models suggests their potential utility as auditing tools for validating high-stakes medical examinations. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Regionalized Medical Competence of Large Language Models: A Performance Benchmark on the Brazilian ENAMED Examination (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen