Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are increasingly applied in medical education, question answering, and clinical reasoning, yet standardized datasets in non-English contexts remain limited. To address this gap, we present CNMLEQA, a benchmark dataset for evaluating LLMs on the Chinese National Medical Licensing Examination. The dataset integrates question-answer pairs from three sources, including PubMed, GitHub, and MedExamLLM. CNMLEQA comprises two subsets: CNMLEQA-10k (9,890 questions) and CNMLEQA-3k (2,949 questions), each consisting of multiple-choice questions with five options and one correct answer. Questions are annotated with key dimensions including: (1) question type (knowledge-based or case-based), (2) auxiliary metadata such as examination year, 3) clinical scenario information across five dimensions: disease or diagnosis, surgery, medication, laboratory examination, and symptom or sign. Annotation was conducted by clinical experts. To validate the dataset, we evaluated state-of-the-art LLMs including Gemini, DeepSeek, GPT, Qwen, and LLaMA, and conducted fine-tuning experiments specifically on Qwen models. Results show that Qwen2.5-32B achieved the accuracy of 90.88% on CNMLEQA-10k, while DeepSeek-R1 achieved the accuracy of 91.59% on CNMLEQA-3k. The fine-tuning experiments further demonstrated significant performance improvements. CNMLEQA provides a multidimensional, clinically grounded benchmark for advancing LLM evaluation in Chinese medical applications.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.687 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.591 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.114 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.867 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.