The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models (2406.10130v1)

Published 14 Jun 2024 in cs.CL

Abstract: Pre-trained LLMs (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train LLMs on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside LLMs by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a LLM that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining LLMing ability with low cost.

PDF HTML Abstract

Interpreting and Mitigating Social Biases in Pre-trained LLMs

"The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained LLMs" introduces a novel approach to tackling social biases inherent in Pre-trained LLMs (PLMs). The authors of this work systematically address both the detection and mitigation of such biases through an interpretable methodology that pinpoints specific neurons responsible for biased behaviors, and then employs a targeted suppression technique to mitigate these biases.

Key Concepts and Methodologies

Social Bias Neurons and Integrated Gap Gradients

The concept of Social Bias Neurons is central to this paper. The authors propose that biases in PLMs can be traced to specific neurons within the model. To identify these neurons, the paper introduces the Integrated Gap Gradients (IG²) method, which builds on the fundamental principles of Integrated Gradients (IG). Unlike standard IG, which attributes individual predictions to specific inputs, IG² attributes biases characterized by distributional differences across demographic categories to particular neurons.

IG² operates by:

Utilizing sentiment-bearing prompts to stimulate the PLM and expose demographic-sensitive responses.
Back-propagating gradients of the difference in predictive probabilities (logits gap) for selected demographics.
Integrating these gradients to identify neurons whose activations significantly influence the bias.

Dataset Construction

A comprehensive dataset was constructed to evaluate the IG² method. The dataset incorporates various demographic dimensions and judgmental modifiers, generating prompts that reveal biased behavior across different social categories. This dataset allows for a detailed analysis of bias across multiple dimensions, such as gender, ethnicity, and more, encompassing both positive and negative biases.

Experimentation and Results

Verification of IG²

The accuracy of IG² in identifying bias-inducing neurons was verified through experiments manipulating these neurons' activations. By either suppressing or amplifying their activations, the authors demonstrated substantial changes in the bias metrics, confirming the efficacy of IG² in pinpointing neurons that contribute to social biases within PLMs.

Debiasing Through Bias Neuron Suppression

For mitigating biases, the authors introduce the Bias Neuron Suppression (BNS) technique, directly derived from the IG² method. BNS involves:

Identifying the social bias neurons with IG².
Setting the activation values of these neurons to zero during model inference to suppress their influence on biased outputs.

BNS was empirically evaluated against several baseline methods, including FairBERTa, DPCE, and AutoDebias, on the StereoSet benchmark. The findings revealed that BNS significantly reduces social biases while preserving the LLMing capabilities of the model. This method outperforms others by achieving higher fairness scores without the computational cost of model retraining.

Insights into Distribution of Bias Neurons

Another intriguing finding from the paper is the analysis of neuron distribution shifts post-debiasing. When comparing FairBERTa with its biased counterpart, RoBERTa, the authors found that retraining on anti-stereotypical data doesn't eliminate social bias neurons but redistributes them from deeper to shallower layers within the model. This substantiates that the effectiveness of FairBERTa's debiasing stems from moving bias-inducing neurons away from the final layers where they have a more pronounced effect on the output.

Practical and Theoretical Implications

The implications of this research are substantial for both practical applications and further theoretical exploration in AI fairness:

Practical Applications: IG² and BNS provide scalable, efficient alternatives to retraining models for bias mitigation, which can be critical for deploying fair AI applications in real-world scenarios with limited computational resources.
Theoretical Exploration: The detailed insights into the behavior and distribution of social bias neurons open new avenues for exploring the inner workings of PLMs. This could facilitate the development of more robust interpretability and debiasing techniques in the future.

Speculative Future Developments

Looking ahead, the methodologies introduced in this paper could stimulate advancements in a few key areas:

Refinement of Suppression Techniques: There is potential to develop more nuanced suppression methods that fine-tune rather than zero-out neuron activations for an even more calibrated debiasing effect.
Extension to Other Types of Bias: Applying similar interpretability approaches to other forms of biases (e.g., political, ideological) in PLMs could enhance their fairness across a wider array of contexts.
Cross-model Analysis: Future research might compare social bias neurons across different model architectures and sizes to better understand how biases manifest and can be controlled across the AI landscape.

In conclusion, this paper advances the frontier of AI fairness by providing interpretable and actionable insights into the neurons responsible for social biases in PLMs. The introduction of IG² for neuron attribution and BNS for cost-effective debiasing marks significant progress in creating fairer AI systems without compromising their performance.