OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 16:02

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Aggregate benchmark scores obscure patient safety implications of errors across frontier language models

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

Abstract Frontier language models are widely used for health-related queries, yet aggregate benchmark scores do not capture safety implications of errors. We applied the recent Nature Medicine triage benchmark across nine frontier models, comparing directional error profiles, contextual bias, and crisis calibration. In-range accuracy ranged from 75.0% to 87.7%, obscuring clinically meaningful error differences. Looking at directionality of errors, under-triage ranged from 0.0% (GPT-5.2) to 12.3% (GPT-5-mini), over-triage varied independently (9.4–36.9%), and under-triage was uncorrelated with aggregate accuracy. When family members minimized symptoms, all models tested shifted toward lower acuity in ambiguous cases (OR range 2.9–14.9), the only contextual effect observed consistently, and access barriers increased under-triage risk in six. Suicide crisis resource mention rates were low and variable across all models. This cross-model heterogeneity and non-monotonic performance across model generations show that aggregate accuracy alone cannot characterize, rank, or predict the clinical safety of deployed language models.

Ähnliche Arbeiten