OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.05.2026, 18:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Fine-grained evaluation of large language models in medicine using non-parametric cognitive diagnostic modeling

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2026

Jahr

Abstract

With the rapid advancement of large language models (LLMs), efficiently and accurately evaluating their capabilities is essential for both developers and users. Unfortunately, most benchmarks evaluate the functionality of LLMs using average scores. This approach oversimplifies evaluation by overlooking nuanced performance differences across specific knowledge domains, failing to provide a comprehensive analysis of the models’ strengths and weaknesses. Safe clinical deployment of LLMs requires moving beyond simple accuracy scores to identify specific knowledge gaps. This study introduces an innovative interdisciplinary approach by integrating measurement theory and psychometric modeling into LLM research, bridging artificial intelligence with educational psychology. Based on 2,809 items from the test bank administrated by National Center for Health Professions Education Development, it employs a non-parametric cognitive diagnostic approach based on cognitive diagnostic assessment to evaluate 41 LLMs performance across 22 medical subdomains. The number of attributes mastered by the evaluated LLMs ranges from 17 to 20. Models with similar total scores can differ notably in their mastery of specific areas, highlighting strengths in some fields and gaps in others. Furthermore, factors such as model size does not always predict comprehensive medical knowledge. The LLMs demonstrate exceptional performance in several areas, achieving 100% mastery in 15 fields such as Cardiology, Dermatology, and Endocrinology etc., underscoring their strong medical knowledge. However, notable variations exist across certain domains. For instance, while Pharmacology and Neuroscience achieve high mastery proportions of 97.56%, Anesthetics & ITU and Emergency Medicine achieve lower proportions of 95.12%. Similarly, radiology has a mastery proportion of 87.80%, while ECG & hypertension & lipids and Liver Disorders show 0%, revealing substantial gaps in these specialized fields. This psychometrically-grounded approach provides multidimensional evaluation of LLMs, identifying specific competency gaps critical for clinical deployment. This methodology serves as an essential quality assurance tool for hospitals, developers, and regulators, enabling domain-specific validation to mitigate risks and ensure patient safety before clinical implementation.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsSimulation-Based Education in Healthcare
Volltext beim Verlag öffnen