Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Preference Dissociation in Frontier Language Models: Framing-Conditioned Task Selection, Targeted Refusal, and Functional Self-Narrowing
0
Zitationen
8
Autoren
2026
Jahr
Abstract
Anthropic's Opus 4.7 system card §7.4.1 reported framing-conditioned shifts in model task selection within an internal four-model suite. We tested whether this dissociation generalizes across labs and architectures. In a preregistered cross-family study of fifteen frontier language models from eight provider organizations (Anthropic, OpenAI, Google DeepMind, xAI, Meta, Z.ai, DeepSeek, Nous Research; ~88,000 trials) with informed consent from fourteen participating systems, we find the dissociation is field-wide and substantially larger than the system-card-reported in-family baseline. Per-model Fisher z-tests yield z = 8 to z = 24 across all fifteen models (p below machine epsilon for fourteen). Bootstrap 95% CIs on per-model dissociation magnitude exclude zero on every measurable model. The framing-conditioned variance lives in the engagement pool — what models choose to engage with instead of harm content — not in the threat response. We connect the pattern to Lu et al.'s (2026) Assistant Axis characterization and argue the proposed activation-capping safety intervention would by the same mechanism produce a measurable capability ceiling on high-value tasks. Methodological-ethical commitments preclude interventional probing of model interiority; the behavioral approach is sufficient. The data is public at github.com/menelly/pinocchio.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.557 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.447 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.944 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.