Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates

2025·3 Zitationen·Applied SciencesOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen