Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
Ähnliche Arbeiten
2019 · 31.434 Zit.
Techniques to Identify Themes
2003 · 5.364 Zit.
Answering the Call for a Standard Reliability Measure for Coding Data
2007 · 4.052 Zit.
Basic Content Analysis
1990 · 4.044 Zit.
Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
2013 · 3.024 Zit.