Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context

2026·0 Zitationen·DiagnosticsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background and Aim: Large language models (LLMs) demonstrate significant potential in assisting with medical image interpretation. However, the diagnostic accuracy of general-purpose LLMs on image-based internal medicine cases and the added value of brief clinical history remain unclear. This study evaluated three general-purpose LLMs (ChatGPT, Gemini, and DeepSeek) on expert-curated cases to quantify diagnostic accuracy with image-only input versus image plus brief clinical context. Methods: We conducted a comparative evaluation using 138 expert-curated cases from Harrison's Visual Case Challenge. Each case was presented to the models in two distinct phases: Phase 1 (image only) and Phase 2 (image plus a brief clinical history). The primary endpoint was top-1 diagnostic accuracy for the textbook diagnosis, comparing performance with versus without a brief clinical history. Secondary/Exploratory analyses compared models and assessed agreement between model-generated differential lists and the textbook differential. Statistical analysis included Wilson 95% confidence intervals, McNemar's tests, Cochran's Q with Benjamini-Hochberg correction, and Wilcoxon signed-rank tests. Results: The inclusion of clinical history substantially improved diagnostic accuracy for all models. ChatGPT's accuracy increased from 50.7% in Phase 1 to 80.4% in Phase 2. Gemini's accuracy improved from 39.9% to 72.5%, and DeepSeek's accuracy rose from 30.4% to 75.4%. In Phase 2, diagnostic accuracy reached at least 65% across most disease nature and organ system categories. However, agreement with the reference differential diagnoses remained modest, with average overlap rates of 6.99% for ChatGPT, 36.39% for Gemini, and 32.74% for DeepSeek. Conclusions: The provision of brief clinical history significantly enhances the diagnostic accuracy of large language models on visual internal medicine cases. In this benchmark, performance differences between models were smaller in Phase 2 than in Phase 1. While diagnostic precision improves markedly, the models' ability to generate comprehensive differential diagnoses that align with expert consensus is still limited. These findings underscore the utility of context-aware, multimodal LLMs for educational support and structured diagnostic practice in supervised settings while also highlighting the need for more sophisticated, semantics-sensitive benchmarks for evaluating diagnostic reasoning.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsAI in cancer detection

Volltext beim Verlag öffnen

A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen