Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of Large Language Models on the Chinese Dental Licensing Examination
0
Zitationen
2
Autoren
2025
Jahr
Abstract
<title>Abstract</title> <bold>Objective:</bold> This study aimed to evaluate the performance of large language models (LLMs) on the Chinese Dental Licensing Examination (CDLE). It also examined whether including an ‘unknown’ option in prompts—or combining this option with a penalty for incorrect answers—could improve model accuracy and reduce hallucinations. <bold>Methods:</bold> The official preparation book, titled <italic>Historical Chinese Dental Licensing Examinations</italic> , authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group, was used as the data source. Three cloud-based models (Qwen3-Max, Qwen-Plus, DeepSeek-V3.1) and two locally deployed models (Qwen3-32B and GPT-OSS-120B) were evaluated on the CDLE. A custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. Model performance was evaluated at both the exam and question levels. Exam-level performance was assessed by mean accuracy (± standard deviation (SD)) and pass/fail outcomes, while question-level performance was evaluated primarily by accuracy with 95% and 99% confidence intervals (CIs). <bold>Results:</bold> A dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question. Qwen3-Max, Qwen-Plus, DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B achieved exam-level mean accuracies ±SD of 0.866±0.089, 0.851±0.0767, 0.737±0.0738, 0.748±0.0868, 0.652±0.0799, respectively. At the question level, the accuracies with 95% CIs were 0.865 (0.852–0.878), 0.851 (0.837–0.865), 0.727 (0.709–0.745), 0.741 (0.724–0.756), and 0.651 (0.634–0.671), respectively. Prompts that included an ‘unknown’ option—or combined it with a penalty for incorrect answers—did not improve model accuracy. <bold>Conclusion:</bold> All models successfully passed the CDLEs, with some achieving remarkably high scores. Among them, Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Other uncertainty estimation methods should be considered instead of simply adding an ‘unknown’ option to the input prompt. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students’ self-directed learning.
Ähnliche Arbeiten
Persistence of coronaviruses on inanimate surfaces and their inactivation with biocidal agents
2020 · 3.881 Zit.
State of the art of zirconia for dental applications
2007 · 2.059 Zit.
A Critical Review of the Durability of Adhesion to Tooth Tissue: Methods and Results
2005 · 1.932 Zit.
Comparison of surface roughness of oral hard materials to the threshold surface roughness for bacterial plaque retention: A review of the literature
1997 · 1.769 Zit.
Coronavirus Disease 2019 (COVID-19): Emerging and Future Challenges for Dental and Oral Medicine
2020 · 1.715 Zit.