Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bias Testing and Mitigation in LLM-based Code Generation

Published 3 Sep 2023 in cs.SE and cs.AI | (2309.14345v4)

Abstract: As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models but are underexplored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive empirical study on the biases in code generated by five widely studied LLMs (i.e., PALM-2-CodeChat-bison, Claude-instant-1, GPT-3.5-turbo, GPT-4-turbo, and GPT-4). Our findings reveal that biases are prevalent. For example, 13.47% to 49.10% of the codes generated by these LLMs have biased behaviors towards gender. Moreover, we study five bias mitigation prompt strategies that are commonly used in current code generation scenarios, i.e., zero-shot, one-shot, few-shot, and two Chain-of-Thought (CoT) prompts, with and without provided feedback-driven refinement. Our evaluation results illustrate that using direct prompt engineering strategies has limited effectiveness in mitigating bias, but our test execution feedback can help to reduce the ratio of code biases to a large extent (e.g., from 59.88% to 4.79% for GPT-4).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. 2023. Adult Income Dataset. www.kaggle.com/datasets/wenruliu/adult-income-dataset. Accessed on August 1, 2023.
  2. 2023. codebiasassessment. https://anonymous.4open.science/r/Code_Bias_assessment-6627 Accessed on August 1, 2023.
  3. 2023. Employee Dataset. www.kaggle.com/datasets/tawfikelmetwally/employee-dataset. Accessed on August 1, 2023.
  4. 2023. US Health Insurance Dataset. www.kaggle.com/datasets/teertha/ushealthinsurancedataset. Accessed on August 1, 2023.
  5. Unified Pre-training for Program Understanding and Generation. ArXiv abs/2103.06333 (2021). https://api.semanticscholar.org/CorpusID:232185260
  6. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
  7. Gender stereotypes in job advertisements: What do they imply for the gender salary gap? Journal of Labor Research 43, 1 (2022), 65–102.
  8. RedditBias: A real-world resource for bias evaluation and debiasing of conversational language models. arXiv preprint arXiv:2106.03521 (2021).
  9. To” see” is to stereotype: Image tagging algorithms, gender recognition, and the accuracy-fairness trade-off. Proceedings of the ACM on Human-Computer Interaction 4, CSCW3 (2021), 1–31.
  10. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  12. Fairness Improvement with Multiple Protected Attributes: How Far Are We?. In 46th International Conference on Software Engineering (ICSE 2024). ACM.
  13. MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1122–1134.
  14. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–30.
  15. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  16. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402 (2023).
  17. Emergent unfairness in algorithmic fairness-accuracy trade-off research. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 46–54.
  18. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
  19. Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In International conference on machine learning. PMLR, 2803–2813.
  20. Out of the BLEU: how should we assess quality of the Code Generation models? J. Syst. Softw. 203 (2022), 111741. https://api.semanticscholar.org/CorpusID:251371647
  21. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. arXiv preprint arXiv:2306.15087 (2023).
  22. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  23. Eve Fleisig and Christiane Fellbaum. 2022. Mitigating Gender Bias in Machine Translation through Adversarial Learning. arXiv preprint arXiv:2203.10675 (2022).
  24. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning. PMLR, 10867–10878.
  25. GraphCodeBERT: Pre-training Code Representations with Data Flow. ArXiv abs/2009.08366 (2020).
  26. Bargaining while Black: The role of race in salary negotiations. Journal of Applied Psychology 104, 4 (2019), 581.
  27. Bias mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068 (2022).
  28. CodeCoT and Beyond: Learning to Program and Test like a Developer. arXiv preprint arXiv:2308.08784 (2023).
  29. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
  30. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
  31. KoSBI: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application. arXiv preprint arXiv:2305.17701 (2023).
  32. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  33. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  34. Suyun Liu and Luis Nunes Vicente. 2022. Accuracy and fairness trade-offs in machine learning: A stochastic multi-objective approach. Computational Management Science 19, 3 (2022), 513–537.
  35. Uncovering and Quantifying Social Biases in Code Generation. ArXiv abs/2305.15377 (2023). https://api.semanticscholar.org/CorpusID:258866038
  36. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379 (2023).
  37. Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 (2022).
  38. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/CorpusID:201666566
  39. CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. ICLR (2023).
  40. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ICLR (2023).
  41. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  42. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  43. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
  44. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  45. Archit Parnami and Minwoo Lee. 2022. Learning from few examples: A summary of approaches to few-shot learning. arXiv preprint arXiv:2203.04291 (2022).
  46. Jean-Philippe Platteau and Darwin Ugarte Ontiveros. 2021. Cognitive bias in insurance: evidence from a health scheme in India. World Development 144 (2021), 105498.
  47. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
  48. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv abs/2009.10297 (2020). https://api.semanticscholar.org/CorpusID:221836101
  49. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. arXiv preprint arXiv:2305.14930 (2023).
  50. What does it mean to’solve’the problem of discrimination in hiring? Social, technical and legal perspectives from the UK on automated hiring systems. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 458–468.
  51. How to do a salary equity study: With an illustrative example from higher education. Public personnel management 49, 1 (2020), 57–82.
  52. Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions. arXiv preprint arXiv:2306.04597 (2023).
  53. Efficient few-shot learning without prompts. arXiv preprint arXiv:2209.11055 (2022).
  54. A Robust Bias Mitigation procedure based on the stereotype content model. arXiv preprint arXiv:2210.14552 (2022).
  55. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
  56. ReCode: Robustness Evaluation of Code Generation Models. ArXiv abs/2212.10264 (2022). https://api.semanticscholar.org/CorpusID:254877229
  57. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
  58. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  59. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP.
  60. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  61. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
  62. CERT: Continual Pre-training on Sketches for Library-oriented Code Generation. In The 2022 International Joint Conference on Artificial Intelligence.
  63. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).
  64. CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models. arXiv preprint arXiv:2305.11262 (2023).
Citations (12)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.