Potential and Challenges of Model Editing for Social Debiasing (2402.13462v1)
Abstract: LLMs trained on vast corpora suffer from inevitable stereotype biases. Mitigating these biases with fine-tuning could be both costly and data-hungry. Model editing methods, which focus on modifying LLMs in a post-hoc manner, are of great potential to address debiasing. However, it lacks a comprehensive study that facilitates both internal and external model editing methods, supports various bias types, as well as understands the pros and cons of applying editing methods to stereotypical debiasing. To mitigate this gap, we carefully formulate social debiasing into an editing problem and benchmark seven existing model editing algorithms on stereotypical debiasing, i.e., debias editing. Our findings in three scenarios reveal both the potential and challenges of debias editing: (1) Existing model editing methods can effectively preserve knowledge and mitigate biases, while the generalization of debias effect from edited sentences to semantically equivalent sentences is limited.(2) Sequential editing highlights the robustness of SERAC (Mitchell et al. 2022b), while internal editing methods degenerate with the number of edits. (3) Model editing algorithms achieve generalization towards unseen biases both within the same type and from different types. In light of these findings, we further propose two simple but effective methods to improve debias editing, and experimentally show the effectiveness of the proposed methods.
- Persistent anti-muslim bias in large language models.
- DUnE: Dataset for unified editing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1847–1861, Singapore. Association for Computational Linguistics.
- Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 1–16, Barcelona, Spain (Online). Association for Computational Linguistics.
- Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 575–593, Toronto, Canada. Association for Computational Linguistics.
- Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Queer people are people first: Deconstructing sexual identity stereotypes in large language models.
- Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.
- Palm-e: An embodied multimodal language model.
- Improving gender fairness of pre-trained language models without catastrophic forgetting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1249–1262, Toronto, Canada. Association for Computational Linguistics.
- Bias and fairness in large language models: A survey.
- A framework for few-shot language model evaluation.
- Gender-tuning: Empowering fine-tuning for debiasing pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5448–5458, Toronto, Canada. Association for Computational Linguistics.
- Detoxifying text with MaRCo: Controllable revision with experts and anti-experts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 228–242, Toronto, Canada. Association for Computational Linguistics.
- Balancing out bias: Achieving fairness through balanced training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11335–11350, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Detect and perturb: Neutral rewriting of biased and sensitive text via gradient-based decoding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4173–4181, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Controlling bias exposure for fair interpretable predictions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5854–5866, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Inspecting and editing knowledge representations in language models.
- Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 65–83, Online. Association for Computational Linguistics.
- Critic-guided decoding for controlled text generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, Toronto, Canada. Association for Computational Linguistics.
- Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14254–14267, Toronto, Canada. Association for Computational Linguistics.
- Debiasing algorithm through model adaptation.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Does gender matter? towards fairness in dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4403–4416, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Interfair: Debiasing with natural language feedback for fair interpretable predictions.
- Editing personality for llms.
- Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. arXiv preprint arXiv:2212.10678.
- Using in-context learning to improve dialogue safety.
- Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
- Locating and editing factual associations in gpt.
- Mass-editing memory in a transformer.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- Fast model editing at scale.
- Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
- Nationality bias in text generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–122, Dubrovnik, Croatia. Association for Computational Linguistics.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Never too late to learn: Regularizing gender bias in coreference resolution. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, page 15–23, New York, NY, USA. Association for Computing Machinery.
- Nirmalendu Prakash and Roy Ka-Wei Lee. 2023. Layered bias: Interpreting bias in pretrained large language models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 284–295, Singapore. Association for Computational Linguistics.
- Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108.
- Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- A trip towards fairness: Bias and de-biasing in large language models. arXiv preprint arXiv:2305.13862.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- First the worst: Finding better gender translations during beam search. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3814–3823, Dublin, Ireland. Association for Computational Linguistics.
- Massive editing for large language models via meta learning.
- Ewoenam Kwaku Tokpo and Toon Calders. 2022. Text style transfer for bias mitigation using masked language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 163–171, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Easyedit: An easy-to-use knowledge editing framework for large language models.
- Chatcad: Interactive computer-aided diagnosis on medical image using large language models.
- Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032.
- Compensatory debiasing for gender imbalances in language models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Adept: A debiasing prompt framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10780–10788.
- Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048.
- A comprehensive study of knowledge editing for large language models.
- Can we edit factual knowledge by in-context learning?
- Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4227–4241.
- Modifying memories in transformer models.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Florence, Italy. Association for Computational Linguistics.
- Representation engineering: A top-down approach to ai transparency.
- Jianhao Yan (27 papers)
- Futing Wang (4 papers)
- Yafu Li (26 papers)
- Yue Zhang (620 papers)