Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
3
Zitationen
9
Autoren
2025
Jahr
Abstract
This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.