Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios

2025·2 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Large language models (LLMs) have emerged as a dominant form of generative artificial intelligence (GenAI) in multiple domains. In early 2025, DeepSeek R1 was released, which is a new large reasoning model (LRM) that includes CoT (CoT) reasoning, Mixture of Experts (MoE), and reinforcement learning. As these technologies continue to improve, evaluating the accuracy and reliability of LLMs and LRMs in medicine remains a crucial challenge. This paper reports on a follow-up study using DeepSeek R1 to evaluate medical scenarios contained in the MMLU-Pro benchmark, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. In the previously reported study, the accuracy rate was 96% when multiple-choice MMLU-Pro answers were provided. In the current study, we evaluated DeepSeek R1 on 162 medical scenarios, but without multiple-choice answers provided. The overall accuracy was 92%. This approach mirrors a more realistic clinical scenario where the clinician must decide on the most likely diagnosis and differential diagnoses without any clues. Further research is necessary to determine how to deploy LRMs in clinical medicine, given their high accuracy rate, both with and without answers provided.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationArtificial Intelligence in HealthcareMachine Learning in Healthcare

Volltext beim Verlag öffnen

Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen