Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation (Preprint)
0
Zitationen
5
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Despite the transformative potential of artificial intelligence (AI)–based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operations. However, as of 2025, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots. </sec> <sec> <title>OBJECTIVE</title> This study aims to evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves answer accuracy on medical multiple-choice questions (MCQs) comparable to that of leading chatbots (ie, ChatGPT and Gemini) on German medical MCQs, thereby assessing its potential as a privacy-preserving alternative for clinical use. </sec> <sec> <title>METHODS</title> A total of 200 interdisciplinary MCQs from the German Progress Test Medicine were administered to ChatGPT (GPT-o3-mini), DeepSeek (DeepSeek-R1), and Gemini (Gemini 2.0 Flash). Accuracy was defined as the proportion of correctly solved questions. Overall differences among the 3 models were tested with the Cochran Q test, while pairwise comparisons were conducted using the McNemar test. Subgroup analyses were performed by medical domain (Fisher exact test) and question length (Wilcoxon rank-sum test). An a priori power analysis indicated a minimum sample size of 195 questions. </sec> <sec> <title>RESULTS</title> All 3 chatbots surpassed the conventional passing threshold of 60%, with accuracies of 96% (192/200) for DeepSeek, 94% (188/200) for Gemini, and 92.5% (185/200) for ChatGPT. The overall difference among models was not statistically significant (<i>P</i>=.10) nor were pairwise comparisons. However, incorrect responses were significantly associated with longer question length for DeepSeek (<i>P</i>=.049) and ChatGPT (<i>P</i>=.04) but not for Gemini. No significant differences in performance were observed across clinical versus preclinical domains or medical specialties (all <i>P</i>&gt;.05). </sec> <sec> <title>CONCLUSIONS</title> Overall, DeepSeek demonstrates outstanding performance on German medical MCQs comparable to the widely used chatbots ChatGPT and Gemini. Similar to ChatGPT, DeepSeek’s performance declined with increasing question length, highlighting verbosity as a persistent challenge for large language models. While DeepSeek’s offline capability and lower operational costs are advantageous, its safe and reliable application in clinical contexts requires further investigation. </sec> <sec> <title>CLINICALTRIAL</title> <p/> </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.485 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.371 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.827 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.549 Zit.