Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract Diagnostic errors in rare and complex diseases contribute significantly to morbidity and mortality. The ability of large language models (LLMs) to enhance diagnostic performance in such cases remains uncertain. This study compares the diagnostic accuracy, clinical reasoning quality, and inference efficiency of three ChatGPT variants (o3-mini, o3-mini-high, o1) and DeepSeek-R1 using 30 English-language case reports of rare and complex diseases from 26 specialties across 15 countries, sourced from PubMed and Web of Science Core Collection databases. Cases were selected to avoid overlap with model training data. Each case was processed once by each model, with outputs anonymized and evaluated in a double-blind manner by two board-certified physicians (each with >15 years’ clinical experience) and ChatGPT-4o. Diagnostic accuracy, the primary outcome, ranged between 30.0% and 40.0% with no significant differences observed among models (Cochran’s Q test, P = 0.16). ChatGPT-o1 achieved the highest accuracy (12/30, 40.0%; 95% CI, 24.6%+/-57.7%), followed by ChatGPT-o3-mini and o3-mini-high (each 11/30, 36.7%), and DeepSeek-R1 (9/30, 30.0% for each English and Chinese language inputs). Mean reasoning scores differed significantly (P < 0.05): ChatGPT-o1, 4.08 +/- 0.82; DeepSeek-R1 (English), 3.86 +/- 0.86; ChatGPT-o3-mini, 3.71 +/- 0.90; ChatGPT-o3-mini-high, 3.69 +/- 0.80; DeepSeek-R1 (Chinese), 3.67 +/- 0.84. Inter-evaluator agreement was high (ICC = 0.84; 95% CI, 0.80-0.88). Inference times varied significantly (P < 0.001), with ChatGPT-o3-mini being fastest (7.0 +/- 3.8 s) and DeepSeek-R1 (English) slowest (46.5 +/- 32.5 s). Advanced LLMs demonstrate potential to support diagnosis of rare and complex diseases, with transparent reasoning processes that may aid clinical decision-making and medical education. Further domain-specific refinement and prospective clinical validation are essential for safe and effective integration into clinical practice. Highlights While LLMs showed similar diagnostic accuracy (30-40%) in rare and complex diseases, ChatGPT-o1 significantly excelled in the quality of its clinical reasoning. Inference speeds varied dramatically (7s-47s), highlighting a critical trade-off between model performance and real-world utility. The transparent reasoning of LLMs shows clear promise as a tool to support clinical decision-making and medical education. Safe clinical implementation is dependent on future domain-specific refinement and prospective validation.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.