Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Regionalized Medical Competence of Large Language Models: A Performance Benchmark on the Brazilian ENAMED Examination (Preprint)
0
Zitationen
9
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language models (LLMs) show potential for clinical decision support, but current evaluations rely heavily on Anglophone benchmarks, limiting their applicability in specific healthcare contexts like Brazil. </sec> <sec> <title>OBJECTIVE</title> To compare the performance of frontier generalist LLMs and a specialized model (Charcot) on the 2026 Brazilian National Examination for Medical Education (ENAMED), assessing accuracy, response times, and collective error patterns. </sec> <sec> <title>METHODS</title> This observational study evaluated ten LLMs, including GPT-5, Gemini 2.5 Pro, and the Brazilian-specialized model Charcot. The models completed the ENAMED 2026 examination (99 valid items) across five independent runs with randomized question and alternative ordering. The primary outcome was mean accuracy compared to the official answer key. Secondary outcomes included Normalized Mean Response Time (NMRT) and Convergence Error (CE)—defined as a collective bias where at least three generalist models consistently selected the same incorrect alternative. Qualitative analysis of rationales was performed for questions exhibiting high convergence or clinical relevance. </sec> <sec> <title>RESULTS</title> Nine models exceeded 85% accuracy. The specialized model, Charcot, achieved the highest mean accuracy (96.96%), significantly outperforming the top generalist models, GPT-5 (94.34%) and Gemini 2.5 Pro (93.94%) (P < .001). Charcot demonstrated superior performance in items requiring knowledge of specific Brazilian guidelines. The CE analysis revealed that generalist models often converged on incorrect answers in domains such as tuberculosis and prenatal care, whereas the specialized model aligned with local protocols. Conversely, model consensus correctly identified an inconsistency in the official answer key regarding indigenous health. No significant correlation was found between response time and global accuracy. </sec> <sec> <title>CONCLUSIONS</title> Domain specialization in the Portuguese language and Brazilian medical context confers a measurable advantage in complex medical tasks, reducing errors derived from training biases present in generalist models. While frontier models demonstrate near-human or superhuman performance on multiple-choice questions, the persistence of collective errors highlights the need for continuous expert supervision. Furthermore, the consensus among models suggests their potential utility as auditing tools for validating high-stakes medical examinations. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.349 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.219 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.631 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.480 Zit.