Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the Performance of Large Language Models on the MCAT
0
Zitationen
9
Autoren
2025
Jahr
Abstract
The emergence of large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Gemini has opened new possibilities for their use in standardized test preparation. This study evaluates the performance of ChatGPT 4.0 and Gemini (both December 2024 versions) on an official AAMC full-length Medical College Admission Test (MCAT) practice exam. Using a standardized input protocol, we compared the models’ answers and calculated accuracy, MCAT section scores, and percentile rankings. ChatGPT outperformed Gemini, achieving a score of 522 (99th percentile) with an accuracy of 90.87%, compared to Gemini’s 518 (95th percentile) and 84.78% accuracy, with a statistically significant difference $(\mathbf{p}=\mathbf{0. 0 0 5})$. While these results highlight the educational potential of LLMs, they also raise important questions about the relevance of standardized testing in an era of increasingly accessible AI tools. Furthermore, although LLMs perform well on exams, their non-negligible error rates caution against their use in clinical decision-making without human oversight. This study contributes to the growing discussion on how AI may transform medical education, assessment, and the future role of physicians.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.