Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination: A Multi-Year Comparative Study
0
Zitationen
7
Autoren
2025
Jahr
Abstract
<title>Abstract</title> <bold>Background</bold> Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and reasoning. However, their real-world applicability in high-stakes medical assessments remains underexplored, particularly in non-English contexts. This study aims to evaluate the performance of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination (NMLE), a comprehensive benchmark of medical knowledge and clinical reasoning. <bold>Methods</bold> We evaluated the performance of ChatGPT and DeepSeek-R1 on the Chinese National Unified Legal Professional Qualification Examination (2019–2021) using question-level binary accuracy (correct = 1, incorrect = 0) as the outcome. A generalized linear mixed model (GLMM) with a binomial distribution and logit link was used to examine fixed effects of model type, year, and subject unit, including their interactions, while accounting for random intercepts across questions. Post hoc pairwise comparisons were conducted to assess differences across model–year interactions. <bold>Results</bold> DeepSeek-R1 significantly outperformed ChatGPT overall (β = –1.829, <italic>p</italic> < 0.001). Temporal analysis revealed a significant decline in ChatGPT's accuracy from 2019 to 2021 ( <italic>p</italic> < 0.05), whereas DeepSeek-R1 maintained stable performance. Subject-wise, Unit 3 showed the highest accuracy (β = 0.344, p = 0.001) compared to Unit 1. A significant interaction in 2020 (β = –0.567, <italic>p</italic> = 0.009) indicated an amplified performance gap between the two models. These results highlight the importance of model selection, domain adaptation, and temporal robustness in the deployment of large language models for professional legal assessments. <bold>Conclusions</bold> This longitudinal evaluation highlights the potential and limitations of LLMs in medical licensing contexts. While current models demonstrate promising results, further fine-tuning is necessary for clinical applicability. The NMLE offers a robust benchmark for future development of trustworthy AI-assisted medical decision support tools in non-English settings.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.