Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Applicability of Advanced Large Language Models in Laboratory Medicine Test Questions: A Comparative Performance Study

2026·0 Zitationen·SHILAP Revista de lepidopterologíaOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Wenzheng Han,1,&ast; Wenkai Zhu,1,&ast; Gang Feng,1,&ast; Yankang Wang,1 Guang Chen,2 Huan Zhou,3 Bin Quan,4 Qiwen Wu,1 Jianghua Yang,4 Kai Jin,5 Shaoneng Tao,6 Xiaoning Li,1 Qing Chen6 1Department of Clinical Laboratory, The First Affiliated Hospital, Wannan Medical College, Wuhu, Anhui, People’s Republic of China; 2Department of Pediatrics, The First Affiliated Hospital, Wannan Medical College, Wuhu, Anhui, People’s Republic of China; 3School of Laboratory Medicine, Wannan Medical College, Wuhu, Anhui, People’s Republic of China; 4Department of Infectious Diseases, The First Affiliated Hospital, Wannan Medical College, Wuhu, Anhui, People’s Republic of China; 5Department of Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, People’s Republic of China; 6Department of Nuclear Medicine, The First Affiliated Hospital, Wannan Medical College, Wuhu, Anhui, People’s Republic of China&ast;These authors contributed equally to this workCorrespondence: Qing Chen, Department of Nuclear Medicine, The First Affiliated Hospital, Wannan Medical College, Wuhu, Anhui, People’s Republic of China, Email chenq1104@wnmc.edu.cnBackground: While large language models (LLMs) show promise in medical education, their comprehensive performance in specialized domains like medical laboratory science remains inadequately assessed.Purpose: This study aimed to evaluate advanced LLMs on medical laboratory questions, assessing accuracy, natural language generation (NLG) quality, reasoning performance, and efficiency.Methods: We conducted a multi-faceted evaluation of three advanced LLMs (DeepSeek-R1, Gemini-2.5 Pro, GPT-5), benchmarking them against medical laboratory scientists and earlier ChatGPT versions. The evaluation utilized 493 questions sourced from the internal Medical Laboratory Test Bank of Wannan Medical College. These questions comprised both knowledge-based and reasoning-based single- and multiple-choice types (SCQs and MCQs). Performance was measured by accuracy, Macro-F1, response time, NLG scores (ROUGE-L, METEOR), and structured logical reasoning assessment. Appropriate statistical tests (including χ2, Wilcoxon, ANOVA, and non-parametric alternatives) with post-hoc corrections were applied to determine significance.Results: DeepSeek-R1’s accuracy on total questions was 78.3%, nearing the 79.3% of the higher-performing senior expert. Notably, it excelled at complex reasoning-based MCQ, demonstrating an advantage over senior experts with an accuracy of 64.4%, compared to 58.7% (SMLS-1) and 56.7% (SMLS-2). While ChatGPT-5 was the fastest model, DeepSeek-R1 exhibited intermediate efficiency, aligning with human experts on SCQ but requiring more time for MCQ. In terms of NLG, DeepSeek-R1 consistently achieved the highest scores, with ROUGE-L scores of 0.36 ± 0.14 (Total Q), 0.33 ± 0.15 (SCQ), and 0.38 ± 0.13 (MCQ), and METEOR scores of 0.53 ± 0.19 (Total Q), 0.40 ± 0.17 (SCQ), and 0.63 ± 0.14 (MCQ). Furthermore, it significantly outperformed all other LLMs in logical reasoning comprehensiveness. A critical strength was its consistent integration of key negative findings, vital for diagnosis.Conclusion: DeepSeek-R1 approaches or even surpasses senior expert performance in certain tasks, showing strong potential as an effective tool for education and assessment despite slower processing times.Keywords: large language models, medical laboratory science, natural language generation, deepseek-R1, complex reasoning

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationText Readability and SimplificationTopic Modeling

Volltext beim Verlag öffnen

Evaluating the Applicability of Advanced Large Language Models in Laboratory Medicine Test Questions: A Comparative Performance Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen