Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Multimodal large language models challenge NEJM image challenge
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Current evaluations of Large Language Models (LLMs) in medicine primarily focus on text-based benchmarks, leaving their multimodal diagnostic capabilities in complex, real-world clinical scenarios largely undefined. Furthermore, comparisons against large-scale human benchmarks remain scarce. To address this gap, we conducted a comprehensive evaluation of state-of-the-art multimodal LLMs (GPT-4o, Claude 3.7, and Doubao) using 272 complex cases from the New England Journal of Medicine Image Challenge (2009–2025). Uniquely, we benchmarked AI performance against a massive global dataset of 16,401,888 physician responses, representing the largest comparative study of human-AI diagnostic reasoning to date. Strikingly, all multimodal LLMs significantly outperformed the global physician collective (P < 0.001). Claude 3.7 achieved a diagnostic accuracy of 89.0%, surpassing the physician majority vote (46.7%) by an absolute margin of over 40 percentage points. Even in challenging cases where human accuracy fell below 40%, Claude 3.7 maintained an accuracy of 86.5%. A novel finding of this study is the remarkably low concordance between high-performing models and physicians (Cohen’s κ: 0.08–0.24). The ratio of model-advantage to physician-advantage cases reached 15.4:1, suggesting that MLLMs succeed in distinct areas where human cognition often falters. Our findings demonstrate that MLLMs have reached a superhuman tier in multimodal diagnostic accuracy. The substantial performance gap, coupled with low human-AI concordance, implies that MLLMs do not merely replicate human knowledge but utilize fundamentally distinct and complementary diagnostic reasoning pathways. These results position multimodal LLMs as critical, independent second readers capable of augmenting clinical decision-making in diagnostically difficult scenarios.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.
Autoren
Institutionen
- Anhui Medical University(CN)
- Second Affiliated Hospital of Anhui Medical University(CN)
- East China Normal University(CN)
- Soochow University(CN)
- Children's Hospital of Suzhou University(CN)
- First Affiliated Hospital of Soochow University(CN)
- First Affiliated Hospital of Anhui Medical University(CN)
- Hefei Institutes of Physical Science(CN)
- Anhui Provincial Hospital(CN)