Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Background: Large language models (LLMs) have demonstrated strong performance on general medical examinations. Whether this performance translates to highly specialized, subspecialty-level board examinations remains unclear. This study evaluates the accuracy and inter-run stability of contemporary LLMs using authentic European Board of Nuclear Medicine (EBNM) Fellowship Examination material. Methods: Ten LLMs (five proprietary, five open-source) completed 50 EBNM multiple-choice questions across five independent zero-shot runs, resulting in 2500 total inferences. Accuracy was calculated per model across runs. Inter-run reliability was assessed using pairwise Cohen’s kappa coefficients. Pairwise model differences were analyzed using McNemar’s test with Bonferroni correction (α = 0.0011). Results: Mean accuracy ranged from 53.6% to 100.0%, with all models exceeding an illustrative 50% pass threshold. Inter-run reliability varied substantially (κ = 0.370–1.000; mean κ = 0.716). High accuracy did not consistently correspond to high reproducibility. Gemini 2.5 Pro achieved high accuracy (93.6%) but showed the lowest reliability (κ = 0.370), whereas DeepSeek V3.2 demonstrated perfect accuracy and agreement across all runs. No significant correlation between accuracy and reliability was observed (Spearman ρ = 0.394, p = 0.26). Conclusions: LLMs demonstrate strong but heterogeneous performance on high-stakes medical knowledge assessments. Differences in reproducibility highlight the need for multi-run evaluation when considering LLMs for educational or clinical knowledge-support applications and for continued validation using non-disclosed examination material.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.287 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.140 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.534 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.450 Zit.