Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Context matching is not reasoning when performing generalized clinical evaluation of generative language models

2025·0 Zitationen·npj Digital MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Current discussion surrounding the clinical capabilities of generative language models(GLMs) predominantly centers around multiple-choice question-answer(MCQA) benchmarks derived from clinical licensing examinations. While accepted for human examinees, characteristics unique to GLMs bring into question the validity of such benchmarks. Here, we validate five benchmarks using eight GLMs, ablating for parameter size and reasoning capabilities, validating via prompt permutation three key assumptions that underpin the generalizability of MCQA-based assessments: that knowledge is applied, not memorized, that semantic consistency will lead to consistent answers, and that situations with no answers can be recognized. While large models are more resilient to our perturbations compared to small models, we globally invalidate these assumptions, with implications for reasoning models. Additionally, despite retaining the knowledge, small models are prone to memorization. All models exhibit significant failure in null-answer scenarios. We then suggest several adaptations for more robust benchmark designs, more reflective of real-world conditions.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling

Volltext beim Verlag öffnen

Context matching is not reasoning when performing generalized clinical evaluation of generative language models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen