Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Multimodal large language models challenge NEJM image challenge

2026·0 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Current evaluations of Large Language Models (LLMs) in medicine primarily focus on text-based benchmarks, leaving their multimodal diagnostic capabilities in complex, real-world clinical scenarios largely undefined. Furthermore, comparisons against large-scale human benchmarks remain scarce. To address this gap, we conducted a comprehensive evaluation of state-of-the-art multimodal LLMs (GPT-4o, Claude 3.7, and Doubao) using 272 complex cases from the New England Journal of Medicine Image Challenge (2009–2025). Uniquely, we benchmarked AI performance against a massive global dataset of 16,401,888 physician responses, representing the largest comparative study of human-AI diagnostic reasoning to date. Strikingly, all multimodal LLMs significantly outperformed the global physician collective (P < 0.001). Claude 3.7 achieved a diagnostic accuracy of 89.0%, surpassing the physician majority vote (46.7%) by an absolute margin of over 40 percentage points. Even in challenging cases where human accuracy fell below 40%, Claude 3.7 maintained an accuracy of 86.5%. A novel finding of this study is the remarkably low concordance between high-performing models and physicians (Cohen’s κ: 0.08–0.24). The ratio of model-advantage to physician-advantage cases reached 15.4:1, suggesting that MLLMs succeed in distinct areas where human cognition often falters. Our findings demonstrate that MLLMs have reached a superhuman tier in multimodal diagnostic accuracy. The substantial performance gap, coupled with low human-AI concordance, implies that MLLMs do not merely replicate human knowledge but utilize fundamentally distinct and complementary diagnostic reasoning pathways. These results position multimodal LLMs as critical, independent second readers capable of augmenting clinical decision-making in diagnostically difficult scenarios.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Multimodal large language models challenge NEJM image challenge

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen