Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination

2026·0 Zitationen·AIOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background: Large language models (LLMs) have demonstrated strong performance on general medical examinations. Whether this performance translates to highly specialized, subspecialty-level board examinations remains unclear. This study evaluates the accuracy and inter-run stability of contemporary LLMs using authentic European Board of Nuclear Medicine (EBNM) Fellowship Examination material. Methods: Ten LLMs (five proprietary, five open-source) completed 50 EBNM multiple-choice questions across five independent zero-shot runs, resulting in 2500 total inferences. Accuracy was calculated per model across runs. Inter-run reliability was assessed using pairwise Cohen’s kappa coefficients. Pairwise model differences were analyzed using McNemar’s test with Bonferroni correction (α = 0.0011). Results: Mean accuracy ranged from 53.6% to 100.0%, with all models exceeding an illustrative 50% pass threshold. Inter-run reliability varied substantially (κ = 0.370–1.000; mean κ = 0.716). High accuracy did not consistently correspond to high reproducibility. Gemini 2.5 Pro achieved high accuracy (93.6%) but showed the lowest reliability (κ = 0.370), whereas DeepSeek V3.2 demonstrated perfect accuracy and agreement across all runs. No significant correlation between accuracy and reliability was observed (Spearman ρ = 0.394, p = 0.26). Conclusions: LLMs demonstrate strong but heterogeneous performance on high-stakes medical knowledge assessments. Differences in reproducibility highlight the need for multi-run evaluation when considering LLMs for educational or clinical knowledge-support applications and for continued validation using non-disclosed examination material.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingInnovations in Medical Education

Volltext beim Verlag öffnen

Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen