Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Aggregate benchmark scores obscure patient safety implications of errors across frontier language models
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Abstract Frontier language models are widely used for health-related queries, yet aggregate benchmark scores do not capture safety implications of errors. We applied the recent Nature Medicine triage benchmark across nine frontier models, comparing directional error profiles, contextual bias, and crisis calibration. In-range accuracy ranged from 75.0% to 87.7%, obscuring clinically meaningful error differences. Looking at directionality of errors, under-triage ranged from 0.0% (GPT-5.2) to 12.3% (GPT-5-mini), over-triage varied independently (9.4–36.9%), and under-triage was uncorrelated with aggregate accuracy. When family members minimized symptoms, all models tested shifted toward lower acuity in ambiguous cases (OR range 2.9–14.9), the only contextual effect observed consistently, and access barriers increased under-triage risk in six. Suicide crisis resource mention rates were low and variable across all models. This cross-model heterogeneity and non-monotonic performance across model generations show that aggregate accuracy alone cannot characterize, rank, or predict the clinical safety of deployed language models.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.