Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model (2310.12611v1)
Abstract: LLMs (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias. However, we lack tools for effectively and efficiently changing this behavior without hurting general LLMing performance. In this paper, we study three methods for identifying causal relations between LM components and particular output: causal mediation analysis, automated circuit discovery and our novel, efficient method called DiffMask+ based on differential masking. We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation. Our results show significant overlap in the identified components (despite huge differences in the computational requirements of the methods) as well as success in mitigating gender bias, with less damage to general LLMing compared to full model fine-tuning. However, our work also underscores the difficulty of defining and measuring bias, and the sensitivity of causal discovery procedures to dataset choice. We hope our work can contribute to more attention for dataset development, and lead to more effective mitigation strategies for other types of bias.
- Adversarial removal of demographic attributes revisited. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6330–6335, Hong Kong, China. Association for Computational Linguistics.
- Evaluating the underlying gender bias in contextualized word embeddings. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, page 33–39, Florence, Italy. Association for Computational Linguistics.
- Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977, Florence, Italy. Association for Computational Linguistics.
- Identifying and controlling important neurons in neural machine translation. ArXiv, abs/1811.01157.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476.
- Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
- Shikha Bordia and Samuel Bowman. 2019. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 7–15.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186. ArXiv:1608.07187 [cs].
- Towards automated circuit discovery for mechanistic interpretability. (arXiv:2304.14997). ArXiv:2304.14997 [cs].
- How do decisions emerge across layers in neural models? interpretation with differentiable masking. (arXiv:2004.14992). ArXiv:2004.14992 [cs, stat].
- Sparse interventions in language models with differentiable masking. (arXiv:2112.06837). ArXiv:2112.06837 [cs].
- Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 1693–1706, Seattle, United States. Association for Computational Linguistics.
- On measures of biases and harms in nlp. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175.
- A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html.
- Causal abstractions of neural networks. (arXiv:2106.02997). ArXiv:2106.02997 [cs].
- Dissecting recall of factual associations in auto-regressive language models.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Debiasing pre-trained language models via efficient fine-tuning. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pages 59–69, Dublin, Ireland. Association for Computational Linguistics.
- How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning.
- Dirk Hovy and Shannon L. Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591–598, Berlin, Germany. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Neeraja Kirtane and Tanvi Anand. 2022. Mitigating gender stereotypes in Hindi and Marathi. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 145–150, Seattle, Washington. Association for Computational Linguistics.
- The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 11–20, Minneapolis, Minnesota. Association for Computational Linguistics.
- Sustainable modular debiasing of language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4782–4797, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Collecting a large-scale gender bias dataset for coreference resolution and machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2470–2480, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190.
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla.
- Gili Lior and Gabriel Stanovsky. 2023. Comparing humans and models on a similar scale: Towards cognitive gender bias evaluation in coreference resolution.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101.
- Learning sparse neural networks through L0subscript𝐿0{L}_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization. In International Conference on Learning Representations.
- Gender bias in neural natural language processing. Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday, pages 189–202.
- On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), page 622–628, Minneapolis, Minnesota. Association for Computational Linguistics.
- An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1878–1898, Dublin, Ireland. Association for Computational Linguistics.
- Locating and editing factual associations in gpt. (arXiv:2202.05262). ArXiv:2202.05262 [cs].
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
- Pointer sentinel mixture models. ArXiv, abs/1609.07843.
- Quantifying context mixing in transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.
- Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), page 5356–5371, Online. Association for Computational Linguistics.
- Neel Nanda and Joseph Bloom. 2022. Transformerlens.
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
- French crows-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than english. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8521–8531.
- Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods, 19.
- Language models are unsupervised multitask learners.
- Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online. Association for Computational Linguistics.
- Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3363–3377, Online. Association for Computational Linguistics.
- Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
- Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 9:1408–1424.
- Interpreting graph neural networks for {nlp} with differentiable edge masking. In International Conference on Learning Representations.
- The multiberts: BERT reproductions for robustness analysis. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values-targeted datasets. In Advances in Neural Information Processing Systems, volume 34, pages 5861–5873. Curran Associates, Inc.
- You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41.
- Undesirable biases in nlp: Averting a crisis of measurement. arXiv preprint arXiv:2211.13709.
- The birth of bias: A case study on the evolution of gender bias in an English language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 75–75, Seattle, Washington. Association for Computational Linguistics.
- Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, page 12388–12401. Curran Associates, Inc.
- Interpretability in the wild: A circuit for indirect object identification in gpt-2 small.
- BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
- Ethical and social risks of harm from language models. ArXiv, abs/2112.04359.
- Zhongbin Xie and Thomas Lukasiewicz. 2023. An empirical analysis of parameter-efficient methods for debiasing pre-trained language models. (arXiv:2306.04067). ArXiv:2306.04067 [cs].
- Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340.
- Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661.
- Abhijith Chintam (1 paper)
- Rahel Beloch (2 papers)
- Willem Zuidema (32 papers)
- Michael Hanna (11 papers)
- Oskar van der Wal (9 papers)