OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.05.2026, 21:08

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations

2026·0 Zitationen·Scientific DataOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are increasingly applied in medical education, question answering, and clinical reasoning, yet standardized datasets in non-English contexts remain limited. To address this gap, we present CNMLEQA, a benchmark dataset for evaluating LLMs on the Chinese National Medical Licensing Examination. The dataset integrates question-answer pairs from three sources, including PubMed, GitHub, and MedExamLLM. CNMLEQA comprises two subsets: CNMLEQA-10k (9,890 questions) and CNMLEQA-3k (2,949 questions), each consisting of multiple-choice questions with five options and one correct answer. Questions are annotated with key dimensions including: (1) question type (knowledge-based or case-based), (2) auxiliary metadata such as examination year, 3) clinical scenario information across five dimensions: disease or diagnosis, surgery, medication, laboratory examination, and symptom or sign. Annotation was conducted by clinical experts. To validate the dataset, we evaluated state-of-the-art LLMs including Gemini, DeepSeek, GPT, Qwen, and LLaMA, and conducted fine-tuning experiments specifically on Qwen models. Results show that Qwen2.5-32B achieved the accuracy of 90.88% on CNMLEQA-10k, while DeepSeek-R1 achieved the accuracy of 91.59% on CNMLEQA-3k. The fine-tuning experiments further demonstrated significant performance improvements. CNMLEQA provides a multidimensional, clinically grounded benchmark for advancing LLM evaluation in Chinese medical applications.

Ähnliche Arbeiten