OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 07:03

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination: A Multi-Year Comparative Study

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

<title>Abstract</title> <bold>Background</bold> Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and reasoning. However, their real-world applicability in high-stakes medical assessments remains underexplored, particularly in non-English contexts. This study aims to evaluate the performance of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination (NMLE), a comprehensive benchmark of medical knowledge and clinical reasoning. <bold>Methods</bold> We evaluated the performance of ChatGPT and DeepSeek-R1 on the Chinese National Unified Legal Professional Qualification Examination (2019–2021) using question-level binary accuracy (correct = 1, incorrect = 0) as the outcome. A generalized linear mixed model (GLMM) with a binomial distribution and logit link was used to examine fixed effects of model type, year, and subject unit, including their interactions, while accounting for random intercepts across questions. Post hoc pairwise comparisons were conducted to assess differences across model–year interactions. <bold>Results</bold> DeepSeek-R1 significantly outperformed ChatGPT overall (β = –1.829, <italic>p</italic> &lt; 0.001). Temporal analysis revealed a significant decline in ChatGPT's accuracy from 2019 to 2021 ( <italic>p</italic> &lt; 0.05), whereas DeepSeek-R1 maintained stable performance. Subject-wise, Unit 3 showed the highest accuracy (β = 0.344, p = 0.001) compared to Unit 1. A significant interaction in 2020 (β = –0.567, <italic>p</italic> = 0.009) indicated an amplified performance gap between the two models. These results highlight the importance of model selection, domain adaptation, and temporal robustness in the deployment of large language models for professional legal assessments. <bold>Conclusions</bold> This longitudinal evaluation highlights the potential and limitations of LLMs in medical licensing contexts. While current models demonstrate promising results, further fine-tuning is necessary for clinical applicability. The NMLE offers a robust benchmark for future development of trustworthy AI-assisted medical decision support tools in non-English settings.

Ähnliche Arbeiten