Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios
1
Zitationen
7
Autoren
2025
Jahr
Abstract
<title>Abstract</title> Artificial intelligence (AI) is transforming healthcare by assisting with intricate clinical reasoning and diagnosis. Recent research demonstrates that large language models (LLMs), such as ChatGPT and DeepSeek, possess considerable potential in medical comprehension. This study meticulously evaluates the clinical reasoning capabilities of four advanced LLMs, including ChatGPT, DeepSeek, Grok, and Qwen, utilizing the United States Medical Licensing Examination (USMLE) as a standard benchmark. We assess 376 publicly accessible USMLE sample exam questions (Step 1, Step 2 CK, Step 3) from the most recent booklet released in July 2023. We analyze model performance across four question categories—text-only, text with image, text with mathematical reasoning, and integrated text-image-mathematical reasoning—and measure model accuracy at three USMLE steps. Our findings indicate that on Step 2 CK, DeepSeek consistently outperforms other models, achieving a peak accuracy of 93%. Despite ChatGPT’s little latency, the restricted convergence in error patterns suggests that ensemble approaches might enhance effectiveness. Grok and Qwen demonstrate reduced and less dependable accuracy throughout all steps. These findings point out the importance of LLMs in clinical reasoning in medical licensing scenarios. However, we also emphasize that these procedures require improvement to ensure their safe and effective integration into practical healthcare processes.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.