Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios
2
Zitationen
4
Autoren
2025
Jahr
Abstract
Abstract Large language models (LLMs) have emerged as a dominant form of generative artificial intelligence (GenAI) in multiple domains. In early 2025, DeepSeek R1 was released, which is a new large reasoning model (LRM) that includes CoT (CoT) reasoning, Mixture of Experts (MoE), and reinforcement learning. As these technologies continue to improve, evaluating the accuracy and reliability of LLMs and LRMs in medicine remains a crucial challenge. This paper reports on a follow-up study using DeepSeek R1 to evaluate medical scenarios contained in the MMLU-Pro benchmark, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. In the previously reported study, the accuracy rate was 96% when multiple-choice MMLU-Pro answers were provided. In the current study, we evaluated DeepSeek R1 on 162 medical scenarios, but without multiple-choice answers provided. The overall accuracy was 92%. This approach mirrors a more realistic clinical scenario where the clinician must decide on the most likely diagnosis and differential diagnoses without any clues. Further research is necessary to determine how to deploy LRMs in clinical medicine, given their high accuracy rate, both with and without answers provided.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.