Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

2025·1 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Diagnostic errors in rare and complex diseases contribute significantly to morbidity and mortality. The ability of large language models (LLMs) to enhance diagnostic performance in such cases remains uncertain. This study compares the diagnostic accuracy, clinical reasoning quality, and inference efficiency of three ChatGPT variants (o3-mini, o3-mini-high, o1) and DeepSeek-R1 using 30 English-language case reports of rare and complex diseases from 26 specialties across 15 countries, sourced from PubMed and Web of Science Core Collection databases. Cases were selected to avoid overlap with model training data. Each case was processed once by each model, with outputs anonymized and evaluated in a double-blind manner by two board-certified physicians (each with >15 years’ clinical experience) and ChatGPT-4o. Diagnostic accuracy, the primary outcome, ranged between 30.0% and 40.0% with no significant differences observed among models (Cochran’s Q test, P = 0.16). ChatGPT-o1 achieved the highest accuracy (12/30, 40.0%; 95% CI, 24.6%+/-57.7%), followed by ChatGPT-o3-mini and o3-mini-high (each 11/30, 36.7%), and DeepSeek-R1 (9/30, 30.0% for each English and Chinese language inputs). Mean reasoning scores differed significantly (P < 0.05): ChatGPT-o1, 4.08 +/- 0.82; DeepSeek-R1 (English), 3.86 +/- 0.86; ChatGPT-o3-mini, 3.71 +/- 0.90; ChatGPT-o3-mini-high, 3.69 +/- 0.80; DeepSeek-R1 (Chinese), 3.67 +/- 0.84. Inter-evaluator agreement was high (ICC = 0.84; 95% CI, 0.80-0.88). Inference times varied significantly (P < 0.001), with ChatGPT-o3-mini being fastest (7.0 +/- 3.8 s) and DeepSeek-R1 (English) slowest (46.5 +/- 32.5 s). Advanced LLMs demonstrate potential to support diagnosis of rare and complex diseases, with transparent reasoning processes that may aid clinical decision-making and medical education. Further domain-specific refinement and prospective clinical validation are essential for safe and effective integration into clinical practice. Highlights While LLMs showed similar diagnostic accuracy (30-40%) in rare and complex diseases, ChatGPT-o1 significantly excelled in the quality of its clinical reasoning. Inference speeds varied dramatically (7s-47s), highlighting a critical trade-off between model performance and real-world utility. The transparent reasoning of LLMs shows clear promise as a tool to support clinical decision-making and medical education. Safe clinical implementation is dependent on future domain-specific refinement and prospective validation.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen