OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 14:50

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

<title>Abstract</title> Background Multimodal large language models (LLMs), capable of processing both images and text, may enhance diagnostic accuracy in clinical practice, particularly for rare diseases with limited diagnostic expertise. Methods We evaluated three multimodal LLMs, GPT-4o, Claude 3.7 Sonnet, and Doubao, using 272 cases across 11 comprehensive analyses from the New England Journal of Medicine (NEJM) Image Challenge (June 2009–March 2025). Each model was tested with images alone and combined with image-text inputs and compared with responses from 16,401,888 physicians worldwide (mean: 60,301 responses per case). Training data contamination was assessed by comparing the performance of the cases published before and after the respective training cutoffs. The primary outcome was the diagnostic accuracy of the multimodal testing. Results Temporal analysis revealed no evidence of training data contamination, with the models maintaining or improving their performance in the post-cutoff cases. All LLMs significantly outperformed physicians in multimodal testing (exact p &lt; 0.000001 after multiple comparison corrections). The diagnostic accuracies were 89.0% (95% confidence interval [CI] Wilson method, 84.9–92.3) for Claude 3.7 Sonnet, 88.6% (95% CI Wilson method, 84.5–92.0) for GPT-4o, and 71.0% (95% CI Wilson method, 65.3–76.2) for Doubao, compared with 46.7% (95% CI Wilson method, 40.7–52.7) for the physician majority vote, with absolute differences exceeding 40 percentage points. In diagnostically challenging cases with &lt; 40% physician consensus, Claude 3.7 Sonnet maintained 86.5% accuracy, versus 33.4% for physicians. Model-physician concordance was low (Cohen's κ, 0.08–0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7 Sonnet. Adding clinical text improved the accuracy by 28–42 percentage points across all models. At least one model was correct in 96.3% of the cases. Conclusions Multimodal LLMs demonstrated superior diagnostic performance compared to physicians across diverse clinical scenarios, with evidence suggesting authentic reasoning capabilities rather than training-data memorization. These findings support the potential use of multimodal AI as a diagnostic tool in clinical practice.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Clinical Reasoning and Diagnostic SkillsRadiology practices and educationArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen