Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective (2506.05166v2)

Published 5 Jun 2025 in cs.CL and cs.AI

Abstract: LLMs are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

PDF Abstract

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

In the paper "Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective," the authors investigate structural biases in LLMs such as GPT-2 and Llama-2 using mechanistic interpretability techniques. The focus is particularly on demographic and gender biases, assessing how these biases are represented within the models' architecture.

Key Findings

Localized Encoding of Bias: The paper identifies that demographic and gender biases within LLMs are primarily encoded in a localized manner across a small subset of layers and nodes. This suggests that biases can be pinpointed to specific components within the model, rather than being spread randomly across all layers.
Instability Across Perturbations: The biases show instability across different conditions including lexical and syntactic variations and fine-tuning settings. For instance, grammatical structure changes result in differing importance of components, indicating a lack of generalizability in bias representation.
Bias Impact on Other Tasks: Ablation of bias-associated components impacts not just bias-related tasks but also other NLP tasks such as named entity recognition and linguistic acceptability. This is due to functional overlap where important circuits for bias also contribute to broader language understanding tasks.

Methodology

The researchers employ Edge Attribution Patching (EAP) for identifying key edges responsible for biased behavior in LLMs. EAP evaluates the significance of individual connections (edges) between nodes within the model, providing insight into which specific components encode bias. Two bias metrics are used to score the edges: an aggregated difference in predicted positive versus negative tokens, and the cumulative probability assigned to positive tokens. The paper employs Symmetric Token Replacement (STR) for generating corrupted samples necessary for assessing bias-related metrics.

Implications

Targeted Bias Mitigation: By localizing bias to specific components, the paper lays the groundwork for targeted interventions that modify or remove responsible structures without requiring full model retraining. This approach can be more resource-efficient compared to traditional debiasing methods like fine-tuning or data augmentation.
Understanding Model Internals: The findings contribute to a deeper understanding of how biases are structurally encoded, offering a framework for analyzing internal mechanics behind biased outputs in LLMs.
Trade-offs in Debiasing: The negative impact of bias reduction on general language tasks highlights the complexity in disentangling bias from useful language functions, underscoring the need for more precise strategies that preserve model competence while reducing bias.

Conclusions

The paper advances the discussion on bias in LLMs by introducing a mechanistic interpretability perspective that focuses on structural components rather than surface-level outputs. While the paper suggests promising avenues for more efficient bias mitigation strategies, it also emphasizes the challenges in achieving these goals without compromising overall model performance. Future research might explore additional bias types beyond demographic and gender biases, examining if similar structural localization can be achieved for other biases prevalent in LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Bhavik Chandna (4 papers)
Zubair Bashir (1 paper)
Procheta Sen (11 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos