Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information (2403.09516v3)
Abstract: Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in LLMs. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias in the model's representations. Our empirical results across two tasks and two models demonstrate the effectiveness of our method compared to previous approaches that do not rely on labeled data. Moreover, with limited demographic-annotated data, our approach outperforms common debiasing approaches.
- Data decisions and theoretical implications when adversarially learning fair representations. CoRR, abs/1707.00075.
- Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, Austin, Texas. Association for Computational Linguistics.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 120–128, New York, NY, USA. Association for Computing Machinery.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. CoRR, abs/2111.09543.
- Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. arXiv preprint arXiv:2305.10204.
- Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5961–5977, Toronto, Canada. Association for Computational Linguistics.
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.
- Just train twice: Improving group robustness without training group information. CoRR, abs/2107.09044.
- OpenAI. 2022. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- Hadas Orgad and Yonatan Belinkov. 2023. Debiasing NLP models without demographic information. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics, ACL 2023, July 9-14, 2023. Association for Computational Linguistics.
- Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7237–7256. Association for Computational Linguistics.
- Linear adversarial concept erasure. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18400–18421. PMLR.
- Adversarial concept erasure in kernel space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6034–6055, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8281–8291, Online. Association for Computational Linguistics.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340.