Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

2026·0 Zitationen·AIOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

Autoren

Institutionen

University of Southern California(US)

Themen

Computational and Text Analysis MethodsTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen