Large Language Model Bias Mitigation from the Perspective of Knowledge Editing (2405.09341v2)
Abstract: Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
- Unmasking contextual stereotypes: Measuring and mitigating bert’s gender bias. arXiv preprint arXiv:2010.14534, 2020.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
- Fast model debias with machine unlearning. arXiv preprint arXiv:2310.12560, 2023.
- Building stereotype repositories with llms and community engagement for scale and depth. Cross-Cultural Considerations in NLP@ EACL, pp. 84, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Co 2 pt: Mitigating bias in pre-trained language models through counterfactual contrastive prompt tuning. arXiv preprint arXiv:2310.12490, 2023.
- Causal analysis of syntactic agreement mechanisms in neural language models. arXiv preprint arXiv:2106.06087, 2021.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
- Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1012–1023, 2022.
- Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 501–512, 2020.
- Mabel: Attenuating gender bias using textual entailment data. arXiv preprint arXiv:2210.14975, 2022.
- Sustainable modular debiasing of language models. arXiv preprint arXiv:2109.03646, 2021.
- Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases. arXiv preprint arXiv:2307.01595, 2023.
- Towards debiasing sentence representations. arXiv preprint arXiv:2007.08100, 2020.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
- Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020a.
- Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020b.
- Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133, 2020.
- Biases in large language models: Origins, inventory and discussion. ACM Journal of Data and Information Quality, 2023.
- “i’m fully who i am”: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1246–1266, 2023.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Alec Radford et al. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
- Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667, 2020.
- Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
- Detecting unintended social bias in toxic language datasets. arXiv preprint arXiv:2210.11762, 2022.
- Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9:1408–1424, 2021.
- Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032, 2020.
- Thomas Wolf et al. Transformers: State-of-the-art natural language processing, 2020.
- Adept: A debiasing prompt framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 10780–10788, 2023.
- Hate speech and counter speech detection: Conversational context does matter. arXiv preprint arXiv:2206.06423, 2022.
- Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310, 2019.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. arXiv preprint arXiv:1906.04571, 2019.