OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 11:02

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Multimodal large language models challenge NEJM image challenge

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Current evaluations of Large Language Models (LLMs) in medicine primarily focus on text-based benchmarks, leaving their multimodal diagnostic capabilities in complex, real-world clinical scenarios largely undefined. Furthermore, comparisons against large-scale human benchmarks remain scarce. To address this gap, we conducted a comprehensive evaluation of state-of-the-art multimodal LLMs (GPT-4o, Claude 3.7, and Doubao) using 272 complex cases from the New England Journal of Medicine Image Challenge (2009–2025). Uniquely, we benchmarked AI performance against a massive global dataset of 16,401,888 physician responses, representing the largest comparative study of human-AI diagnostic reasoning to date. Strikingly, all multimodal LLMs significantly outperformed the global physician collective (P < 0.001). Claude 3.7 achieved a diagnostic accuracy of 89.0%, surpassing the physician majority vote (46.7%) by an absolute margin of over 40 percentage points. Even in challenging cases where human accuracy fell below 40%, Claude 3.7 maintained an accuracy of 86.5%. A novel finding of this study is the remarkably low concordance between high-performing models and physicians (Cohen’s κ: 0.08–0.24). The ratio of model-advantage to physician-advantage cases reached 15.4:1, suggesting that MLLMs succeed in distinct areas where human cognition often falters. Our findings demonstrate that MLLMs have reached a superhuman tier in multimodal diagnostic accuracy. The substantial performance gap, coupled with low human-AI concordance, implies that MLLMs do not merely replicate human knowledge but utilize fundamentally distinct and complementary diagnostic reasoning pathways. These results position multimodal LLMs as critical, independent second readers capable of augmenting clinical decision-making in diagnostically difficult scenarios.

Ähnliche Arbeiten