Bias Testing and Mitigation in LLM-based Code Generation (2309.14345v3)
Abstract: Utilizing state-of-the-art LLMs, automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning.
- 2023. Adult Income Dataset. www.kaggle.com/datasets/wenruliu/adult-income-dataset. Accessed on August 1, 2023.
- 2023. codebiasassessment. https://anonymous.4open.science/r/Code_Bias_assessment-6627 Accessed on August 1, 2023.
- 2023. Employee Dataset. www.kaggle.com/datasets/tawfikelmetwally/employee-dataset. Accessed on August 1, 2023.
- 2023. US Health Insurance Dataset. www.kaggle.com/datasets/teertha/ushealthinsurancedataset. Accessed on August 1, 2023.
- Unified Pre-training for Program Understanding and Generation. ArXiv abs/2103.06333 (2021). https://api.semanticscholar.org/CorpusID:232185260
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
- Gender stereotypes in job advertisements: What do they imply for the gender salary gap? Journal of Labor Research 43, 1 (2022), 65–102.
- RedditBias: A real-world resource for bias evaluation and debiasing of conversational language models. arXiv preprint arXiv:2106.03521 (2021).
- To” see” is to stereotype: Image tagging algorithms, gender recognition, and the accuracy-fairness trade-off. Proceedings of the ACM on Human-Computer Interaction 4, CSCW3 (2021), 1–31.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Fairness Improvement with Multiple Protected Attributes: How Far Are We?. In 46th International Conference on Software Engineering (ICSE 2024). ACM.
- MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1122–1134.
- A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–30.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402 (2023).
- Emergent unfairness in algorithmic fairness-accuracy trade-off research. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 46–54.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
- Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In International conference on machine learning. PMLR, 2803–2813.
- Out of the BLEU: how should we assess quality of the Code Generation models? J. Syst. Softw. 203 (2022), 111741. https://api.semanticscholar.org/CorpusID:251371647
- WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. arXiv preprint arXiv:2306.15087 (2023).
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
- Eve Fleisig and Christiane Fellbaum. 2022. Mitigating Gender Bias in Machine Translation through Adversarial Learning. arXiv preprint arXiv:2203.10675 (2022).
- The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning. PMLR, 10867–10878.
- GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv abs/2009.08366 (2020).
- Bargaining while Black: The role of race in salary negotiations. Journal of Applied Psychology 104, 4 (2019), 581.
- Bias mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068 (2022).
- CodeCoT and Beyond: Learning to Program and Test like a Developer. arXiv preprint arXiv:2308.08784 (2023).
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
- Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
- KoSBI: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application. arXiv preprint arXiv:2305.17701 (2023).
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- Suyun Liu and Luis Nunes Vicente. 2022. Accuracy and fairness trade-offs in machine learning: A stochastic multi-objective approach. Computational Management Science 19, 3 (2022), 513–537.
- Uncovering and Quantifying Social Biases in Code Generation. ArXiv abs/2305.15377 (2023). https://api.semanticscholar.org/CorpusID:258866038
- Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379 (2023).
- Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 (2022).
- A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/CorpusID:201666566
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. ICLR (2023).
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ICLR (2023).
- OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Archit Parnami and Minwoo Lee. 2022. Learning from few examples: A summary of approaches to few-shot learning. arXiv preprint arXiv:2203.04291 (2022).
- Jean-Philippe Platteau and Darwin Ugarte Ontiveros. 2021. Cognitive bias in insurance: evidence from a health scheme in India. World Development 144 (2021), 105498.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
- CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv abs/2009.10297 (2020). https://api.semanticscholar.org/CorpusID:221836101
- In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. arXiv preprint arXiv:2305.14930 (2023).
- What does it mean to’solve’the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 458–468.
- How to do a salary equity study: With an illustrative example from higher education. Public personnel management 49, 1 (2020), 57–82.
- Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions. arXiv preprint arXiv:2306.04597 (2023).
- Efficient few-shot learning without prompts. arXiv preprint arXiv:2209.11055 (2022).
- A Robust Bias Mitigation procedure based on the stereotype content model. arXiv preprint arXiv:2210.14552 (2022).
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
- ReCode: Robustness Evaluation of Code Generation Models. ArXiv abs/2212.10264 (2022). https://api.semanticscholar.org/CorpusID:254877229
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
- CERT: Continual Pre-training on Sketches for Library-oriented Code Generation. In The 2022 International Joint Conference on Artificial Intelligence.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).
- CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models. arXiv preprint arXiv:2305.11262 (2023).
- Dong Huang (102 papers)
- Qingwen Bu (15 papers)
- Jie Zhang (847 papers)
- Xiaofei Xie (104 papers)
- Junjie Chen (89 papers)
- Heming Cui (29 papers)