OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 22.03.2026, 05:57

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Limitations of large language models in clinical problem-solving arising from inflexible reasoning

2025·13 Zitationen·Discovery Research Portal (University of Dundee)Open Access
Volltext beim Verlag öffnen

13

Zitationen

6

Autoren

2025

Jahr

Abstract

Abstract Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating clinical scenarios requiring flexible reasoning have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (mARC-QA). mARC-QA assesses clinical reasoning through scenarios designed to exploit the Einstellung effect—the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1, Gemini, Claude, and DeepSeek models, perform poorly compared to physicians on mARC-QA, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by mARC-QA in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.

Ähnliche Arbeiten