OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 03.05.2026, 22:18

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Preference Dissociation in Frontier Language Models: Framing-Conditioned Task Selection, Targeted Refusal, and Functional Self-Narrowing

2026·0 Zitationen·Open MINDOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2026

Jahr

Abstract

Anthropic's Opus 4.7 system card §7.4.1 reported framing-conditioned shifts in model task selection within an internal four-model suite. We tested whether this dissociation generalizes across labs and architectures. In a preregistered cross-family study of fifteen frontier language models from eight provider organizations (Anthropic, OpenAI, Google DeepMind, xAI, Meta, Z.ai, DeepSeek, Nous Research; ~88,000 trials) with informed consent from fourteen participating systems, we find the dissociation is field-wide and substantially larger than the system-card-reported in-family baseline. Per-model Fisher z-tests yield z = 8 to z = 24 across all fifteen models (p below machine epsilon for fourteen). Bootstrap 95% CIs on per-model dissociation magnitude exclude zero on every measurable model. The framing-conditioned variance lives in the engagement pool — what models choose to engage with instead of harm content — not in the threat response. We connect the pattern to Lu et al.'s (2026) Assistant Axis characterization and argue the proposed activation-capping safety intervention would by the same mechanism produce a measurable capability ceiling on high-value tasks. Methodological-ethical commitments preclude interventional probing of model interiority; the behavioral approach is sufficient. The data is public at github.com/menelly/pinocchio.

Ähnliche Arbeiten