Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Context matching is not reasoning when performing generalized clinical evaluation of generative language models
0
Zitationen
15
Autoren
2025
Jahr
Abstract
Current discussion surrounding the clinical capabilities of generative language models(GLMs) predominantly centers around multiple-choice question-answer(MCQA) benchmarks derived from clinical licensing examinations. While accepted for human examinees, characteristics unique to GLMs bring into question the validity of such benchmarks. Here, we validate five benchmarks using eight GLMs, ablating for parameter size and reasoning capabilities, validating via prompt permutation three key assumptions that underpin the generalizability of MCQA-based assessments: that knowledge is applied, not memorized, that semantic consistency will lead to consistent answers, and that situations with no answers can be recognized. While large models are more resilient to our perturbations compared to small models, we globally invalidate these assumptions, with implications for reasoning models. Additionally, despite retaining the knowledge, small models are prone to memorization. All models exhibit significant failure in null-answer scenarios. We then suggest several adaptations for more robust benchmark designs, more reflective of real-world conditions.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.534 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.423 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.917 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.582 Zit.