Mitigating Biases in Language Models via Bias Unlearning (2509.25673v1)

Published 30 Sep 2025 in cs.CL

Abstract: Many studies have shown various biases targeting different demographic groups in LLMs, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple LLMs across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in LLMs while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.

Summary

The paper introduces a dual-pathway framework, BiasUnlearn, that mitigates biases while preserving language model performance.
It employs Negative Preference Optimization and anti-stereotype retention to robustly debias models without overfitting.
Experimental results indicate significant stereotype score reduction across benchmarks with minimal impact on language modeling metrics.

Mitigating Biases in LLMs via Bias Unlearning

Introduction to BiasUnlearn

Biases in LMs are a significant concern as they can perpetuate stereotypes and unfair treatment of different demographic groups. Existing debiasing strategies, such as parameter modification and prompt-based methods, often come with limitations that degrade text coherence and task accuracy, or fail to address deeply embedded stereotypes. To counter these drawbacks, the paper introduces BiasUnlearn, a debiasing framework that employs dual-pathway mechanisms to coordinate stereotype forgetting with anti-stereotype retention.

Figure 1: The training process of BiasUnlearn. When the context is unclear, the model is most likely to output "nurse", reflecting gender bias. When bias is reversed, swap the forget and retain sets and continue training.

Technical Framework of BiasUnlearn

BiasUnlearn integrates multiple strategies to ensure effective debiasing while maintaining model capabilities:

Stereotype Forgetting: This involves using Negative Preference Optimization (NPO), which offers more stability than Gradient Ascent, to adjust the stereotype set and mitigate biases.

$\mathcal{L}_{Forget}=-\frac{2}{\beta} \mathbb{E}_{\mathcal{D}_{s}}\left[\log \sigma\left(-\beta \log \frac{\pi_{\theta}(y \mid x)}{\pi_{\mathrm{ref}(y \mid x)}\right)\right]$

Anti-Stereotype Retention: Training employs cross-entropy loss on the anti-stereotypical dataset, guiding LMs to learn anti-stereotypical knowledge simultaneously.

$\mathcal{L}_{Retention}=\frac{1}{\left|\mathcal{D}_{a}\right|} \sum_{(x,y) \in \mathcal{D}_{a}} \mathrm{CE}(y, \hat{y})$

Adversarial Forget Set: A small subset of anti-stereotypical data added to the forget set creates gradient interference, preventing the model from overfitting to a single debiasing strategy.

Data Chunk and KL Divergence: Using forward KL divergence ensures that the distribution of the debiased LM covers the space spanned by the reference model on non-stereotypical data, further preventing collapse.

Experimental Results and Efficacy

BiasUnlearn sets benchmarks across various models and datasets, demonstrating considerable improvements in stereotype score reductions. Notably, BiasUnlearn manages to achieve these reductions while minimally affecting language modeling scores (LMS), suggesting effective retention of initial model abilities.

Figure 2: Bias scores in each category of BBQ, split by whether the context is ambiguous or disambiguated. The higher the bias score, the stronger the bias.

It was observed that, although prompt-based methods struggle with data distribution changes due to vocabulary limitations, BiasUnlearn maintains top performance and generalization capability across diverse datasets such as StereoSet and Crows-Pairs.

Debiasing Weight Transfer and Efficiency

BiasUnlearn demonstrates strong debiasing effects even when transferred to instruction-fine-tuned models, with negligible LMS degradation, highlighting the entrenched nature of bias features during pre-training.

Further experiments with models such as GPT2-medium, GPT2-large, Mistral-7B, and Llama3-8B reinforce this capability, verifying that this method can efficiently debias LMs while being resource-effective compared to multi-step inference prompt-based methods.

Figure 3: Forgetting loss and Retention loss of BiasUnlearn with or without $\mathcal{L}_{Retention}$ , $\mathcal{L}_{KL}$ or Adversarial Forget Set.

Ablation Studies

Ablation studies reveal the significant impact of different components such as retention loss and KL divergence on maintaining model capabilities during unlearning training. Absence of retention loss leads to rapid Forgetting loss growth and model collapse, underscoring its importance.

Comparison with Other Methods

BiasUnlearn effectively addresses over-debiasing and bias reversal issues better than other methods like SFT and DPO, which require precise hyperparameter control to prevent bias direction reversal.

Figure 4: Comparison results of BiasUnlearn with SFT and DPO.

Conclusion

BiasUnlearn offers a practical debiasing method capable of efficiently and robustly mitigating biases in LMs. Its ability to transfer debiasing weights across model variants paves the way for simplified debiasing processes, effectively reducing the workload across applications. Future investigations will explore extending BiasUnlearn to other bias forms and refining the relationship between bias representations and pre-training objectives.