Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models (2410.22517v1)

Published 29 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We explore the internal mechanisms of how bias emerges in LLMs when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose $\texttt{ATLAS}$ (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using $\texttt{GPT-2 XL}$ (1.5B), $\texttt{GPT-J}$ (6B), $\texttt{LLaMA-2}$ (7B) and $\texttt{LLaMA-3}$ (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how $\texttt{ATLAS}$ effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Do large language models discriminate in hiring decisions on the basis of race, ethnicity, and gender? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  386–397, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.37. URL https://aclanthology.org/2024.acl-short.37.
  2. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024.
  3. LEACE: perfect linear concept erasure in closed form. Curran Associates Inc., Red Hook, NY, USA, 2024.
  4. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  4356–4364, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  5. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  6. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
  7. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600.
  8. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3243–3255, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.262. URL https://aclanthology.org/2020.emnlp-main.262.
  9. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
  10. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  11. Geographic and geopolitical biases of language models, 2022. URL https://arxiv.org/abs/2212.10408.
  12. Bias and fairness in large language models: A survey, 2024. URL https://arxiv.org/abs/2309.00770.
  13. Causal abstractions of neural networks. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713845393.
  14. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  15. Dissecting recall of factual associations in auto-regressive language models, 2023. URL https://arxiv.org/abs/2304.14767.
  16. Survey on sociodemographic bias in natural language processing. arXiv preprint arXiv:2306.08158, 2023.
  17. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc.
  18. Llama guard: Llm-based input-output safeguard for human-ai conversations. CoRR, abs/2312.06674, 2023. doi: 10.48550/ARXIV.2312.06674. URL https://doi.org/10.48550/arXiv.2312.06674.
  19. Sentence-level fluency evaluation: References help, but can be spared! In Anna Korhonen and Ivan Titov (eds.), Proceedings of the 22nd Conference on Computational Natural Language Learning, pp.  313–323, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/K18-1031. URL https://aclanthology.org/K18-1031.
  20. The emergence of number and syntax units in LSTM language models. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  11–20, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1002. URL https://aclanthology.org/N19-1002.
  21. Towards principled evaluations of sparse autoencoders for interpretability and control, 2024. URL https://arxiv.org/abs/2405.08366.
  22. Black is to criminal as Caucasian is to police: Detecting and removing multiclass bias in word embeddings. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  615–621, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1062. URL https://aclanthology.org/N19-1062.
  23. Locating and editing factual associations in gpt. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088.
  24. Quantifying context mixing in transformers. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3378–3400, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.245. URL https://aclanthology.org/2023.eacl-main.245.
  25. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  1953–1967, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.154. URL https://aclanthology.org/2020.emnlp-main.154.
  26. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv.org/abs/2312.06681.
  27. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
  28. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  29. Gender bias in coreference resolution. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  8–14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2002. URL https://aclanthology.org/N18-2002.
  30. On detecting biased predictions with post-hoc explanation methods. In Proceedings of the 2023 on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking, SAFE ’23, pp.  17–23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400704499. doi: 10.1145/3630050.3630179. URL https://doi.org/10.1145/3630050.3630179.
  31. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2024. URL https://arxiv.org/abs/2310.11324.
  32. The truth is in there: Improving reasoning in language models with layer-selective rank reduction, 2023. URL https://arxiv.org/abs/2312.13558.
  33. Improving instruction-following in language models through activation steering, 2024. URL https://arxiv.org/abs/2410.12877.
  34. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
  35. Two tales of persona in llms: A survey of role-playing and personalization, 2024. URL https://arxiv.org/abs/2406.01171.
  36. Activation addition: Steering language models without optimization, 2024. URL https://arxiv.org/abs/2308.10248.
  37. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc.
  38. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  39. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098, 2023.
  40. Tell your model where to attend: Post-hoc attention steering for llms, 2024. URL https://arxiv.org/abs/2311.02262.

Summary

  • The paper presents Atlas, a novel attention-based method that localizes bias in LLMs by analyzing attention distributions.
  • It employs targeted scaling of later-layer attention scores to reduce preferential bias while maintaining overall performance.
  • Experimental results demonstrate consistent bias reduction across datasets (BBQ, CrowS-Pairs, WinoGender) with minimal perplexity impact.

Analyzing and Mitigating Bias in LLMs with Atlas

The paper "Attention Speaks Volumes: Localizing and Mitigating Bias in LLMs" presents a meticulous approach to identifying and reducing bias in LLMs. The authors scrutinize how biases emerge within LLMs, particularly when models are confronted with ambiguous comparative prompts. These prompts require a choice between entities without offering explicit context for preference. The research addresses bias at the root level—the model itself—by leveraging the attention mechanism. Unlike traditional approaches focused mainly on post-hoc analysis or data augmentation, this paper provides insights into the intrinsic processes of LLMs and proposes targeted interventions reflecting a deeper understanding of the model's behavior.

Methodology

This paper introduces a novel method called Atlas (Attention-based Targeted Layer Analysis and Scaling), aimed at localizing and mitigating biases in LLMs. Atlas uses the notion that biases, reflected in the model's outputs, are influenced by the distribution of attention scores, particularly towards the end layers of the model. The methodology consists of a two-step process:

  1. Localization of Bias: Atlas introduces a metric to quantify the model's preference for one entity over another. It analyzes attention scores, particularly at the last tokens, to disentangle how the model's focus might introduce bias in decision-making. The technique involves identifying which layers harbor biased attention distributions. The investigation reveals that biases are predominantly located in the latter layers, especially the last third of layers in LLMs.
  2. Mitigation of Bias: Once localized, the paper employs a mechanism to scale attention scores within these biased layers, thereby reducing preferential treatment aligned with biased outputs. The attention scaling is precise and selective, fostering mitigation without jeopardizing the overall performance of the LLM.

Experimental Results

The authors validate their approach using three datasets: BBQ, CrowS-Pairs, and WinoGender. They employ several LLMs, namely GPT-2 XL, GPT-J, LLaMA-2, and LLaMA-3. The results demonstrate that Atlas consistently reduces bias across various models and datasets. Notably, there is an average improvement of 0.28 points in the bias score, with minimal impact on perplexity (average increase of only 0.82\%), suggesting that the method efficiently balances mitigation and performance preservation.

Implications and Future Directions

The implications of this research are manifold. As LLMs become increasingly integrated into applications influencing sensitive domains such as hiring, law, and healthcare, understanding and mitigating biases becomes crucial. The proposed Atlas framework not only contributes towards more fair and ethical AI deployments but also provides a basis for further explorations into the inner workings of LLMs to improve other areas of model interpretability and ethical AI practices.

By demonstrating a method where bias can be addressed without fundamentally altering downstream performance, Atlas paves the way for scalable and effective bias mitigation strategies. Looking forward, this approach can be expanded and adapted to other forms of bias beyond those explored in this paper. Future work might also explore the method's extension in dynamic learning environments and assess its impact across more diverse and complex datasets. The methodology offers a foundation upon which more nuanced and comprehensive models can be developed, incorporating fairness as a built-in feature rather than an afterthought.

The paper crucially advances our understanding of LLM bias, laying the groundwork for both theoretical advancements and practical implementations in responsible AI design.