Bugs in Large Language Models Generated Code: An Empirical Study (2403.08937v2)
Abstract: LLMs for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.
- D. Wong, A. Kothig, and P. Lam, “Exploring the verifiability of code generated by github copilot,” arXiv preprint arXiv:2209.01766, 2022.
- S. Imai, “Is github copilot a substitute for human pair-programming? an empirical study,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 319–321.
- F. A. Sakib, S. H. Khan, and A. Karim, “Extending the frontier of chatgpt: Code generation and debugging,” arXiv preprint arXiv:2307.08260, 2023.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, T. Xie, and Q. Wang, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” arXiv preprint arXiv:2302.00288v1, 2023.
- M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 1646–1656. [Online]. Available: https://doi.org/10.1145/3611643.3613892
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
- A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” Journal of Systems and Software, vol. 203, p. 111734, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121223001292
- C. Bird, D. Ford, T. Zimmermann, N. Forsgren, E. Kalliamvakou, T. Lowdermilk, and I. Gazit, “Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools,” Queue, vol. 20, no. 6, pp. 35–57, 2022.
- O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad as humans at introducing vulnerabilities in code?” Empirical Softw. Engg., vol. 28, no. 6, sep 2023. [Online]. Available: https://doi.org/10.1007/s10664-023-10380-1
- J. Li, Y. Zhang, Z. Karas, C. McMillan, K. Leach, and Y. Huang, “Do machines and humans focus on similar code? exploring explainability of large language models in code summarization,” arXiv preprint arXiv:2402.14182, 2024.
- B. Kou, S. Chen, Z. Wang, L. Ma, and T. Zhang, “Is model attention aligned with human attention? an empirical study on large language models for code generation,” 2023.
- R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 654–665. [Online]. Available: https://doi.org/10.1145/2635868.2635929
- Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1469–1481.
- Y. Liu, T. Le-Cong, R. Widyasari, C. Tantithamthavorn, L. Li, X.-B. D. Le, and D. Lo, “Refining chatgpt-generated code: Characterizing and mitigating code quality issues,” 2023.
- R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in translation: A study of bugs introduced by large language models while translating code,” arXiv preprint arXiv:2308.03109v3, 2023.
- “The replication package,” https://github.com/FlowSs/BugsInLLMs, 2023.
- N. Nguyen and S. Nadi, “An empirical evaluation of github copilot’s code suggestions,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 1–5.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in Accepted by 45th International Conference on Software Engineering (ICSE), 2023.
- X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- F. Christopoulou, G. Lampouras, M. Gritta, G. Zhang, Y. Guo, Z. Li, Q. Zhang, M. Xiao, B. Shen, L. Li et al., “Pangu-coder: Program synthesis with function-level language modeling,” arXiv preprint arXiv:2207.11280, 2022.
- W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang et al., “Pangu-α𝛼\alpha{}italic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” arXiv preprint arXiv:2104.12369, 2021.
- D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan, Y. Wang, and J.-G. Lou, “When neural model meets nl2code: A survey,” arXiv preprint arXiv:2212.09420, 2022.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
- B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” arXiv preprint arXiv:2207.10397, 2022.
- C. B. Seaman, “Qualitative methods in empirical studies of software engineering,” IEEE Transactions on software engineering, vol. 25, no. 4, pp. 557–572, 1999.
- N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 1110–1121.
- M. X. Liu, A. Sarkar, C. Negreanu, B. Zorn, J. Williams, N. Toronto, and A. D. Gordon, ““what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–31.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” 2021.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021.
- F. Tambon, A. Nikanjam, L. An, F. Khomh, and G. Antoniol, “Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow,” Empirical Software Engineering, vol. 29, no. 1, p. 10, 2024.
- Y. Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, and J. Yu, “Security weaknesses of copilot generated code in github,” arXiv preprint arXiv:2310.02059, 2023.
- D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python framework for mining software repositories,” in Proceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2018, pp. 908–911.
- A. Nikanjam, M. M. Morovati, F. Khomh, and H. Ben Braiek, “Faults in deep reinforcement learning programs: a taxonomy and a detection approach,” Automated Software Engineering, vol. 29, 2021.
- V. Nardone, B. Muse, M. Abidi, F. Khomh, and M. Di Penta, “Video game bad smells: What they are and how developers perceive them,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 4, may 2023. [Online]. Available: https://doi.org/10.1145/3563214
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
- Anthropic, “Model card and evaluations for claude models,” 2013. [Online]. Available: https://www.anthropic.com/index/introducing-claude
- “gitindex,” 2023, https://githut.info/.
- “tiobeindex,” 2023, https://www.tiobe.com/tiobe-index/.
- M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” arXiv preprint arXiv:2302.06527, 2023.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
- J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” arXiv preprint arXiv:2305.06599, 2023.
- C. Liu, S. D. Zhang, and R. Jabbarvand, “Codemind: A framework to challenge large language models for code reasoning,” 2024.
- V. Guilherme and A. Vincenzi, “An initial investigation of chatgpt unit test generation capability,” in Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, ser. SAST ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 15–24. [Online]. Available: https://doi.org/10.1145/3624032.3624035
- G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test generation using evosuite,” ACM Trans. Softw. Eng. Methodol., vol. 24, no. 2, dec 2014. [Online]. Available: https://doi.org/10.1145/2685612
- Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs sbst: A comparative assessment of unit test suite generation,” 2023.
- G. L. Scoccia, “Exploring early adopters’ perceptions of chatgpt as a code generation tool,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 2023, pp. 88–93.
- M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” arXiv preprint arXiv:2309.14049, 2023.
- F. Fischer, K. Böttinger, H. Xiao, C. Stransky, Y. Acar, M. Backes, and S. Fahl, “Stack overflow considered harmful? the impact of copy&paste on android application security,” in 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 121–136.
- T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: A study of api misuse on stack overflow,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 2018, pp. 886–896.
- M. Verdi, A. Sami, J. Akhondali, F. Khomh, G. Uddin, and A. K. Motlagh, “An empirical study of c++ vulnerabilities in crowd-sourced code examples,” IEEE Trans. Softw. Eng., vol. 48, no. 5, p. 1497–1514, may 2022. [Online]. Available: https://doi.org/10.1109/TSE.2020.3023664
- P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022, pp. 1–7.
- M. Monperrus, “Automatic software repair: A bibliography,” ACM Comput. Surv., vol. 51, no. 1, jan 2018. [Online]. Available: https://doi.org/10.1145/3105906
- Z. Li, C. Wang, Z. Liu, H. Wang, D. Chen, S. Wang, and C. Gao, “Cctest: Testing and repairing code completion systems,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1238–1250.
- S. Moon, Y. Song, H. Chae, D. Kang, T. Kwon, K. T. iunn Ong, S. won Hwang, and J. Yeo, “Coffee: Boost your code llms by fixing bugs with feedback,” 2023.
- “Leetcode contest,” https://leetcode.com/contest/, 2023.
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv preprint arXiv:2102.04664, 2021.
- N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. von Werra, and S. Longpre, “Octopack: Instruction tuning code large language models,” arXiv preprint arXiv:2308.07124, 2023.
- X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” 2023.
- D. Huang, J. M. Zhang, Y. Qing, and H. Cui, “Effibench: Benchmarking the efficiency of automatically generated code,” 2024.
- Y. Liu, C. Tantithamthavorn, Y. Liu, and L. Li, “On the reliability and explainability of automated code generation approaches,” arXiv preprint arXiv:2302.09587, 2023.
- Z. Ji, P. Ma, Z. Li, and S. Wang, “Benchmarking and explaining large language model-based code generation: A causality-centric approach,” 2023.
- A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” arXiv preprint arXiv:2302.00438, 2023.
- S. Honarvar, M. van der Wilk, and A. Donaldson, “Turbulence: Systematically and automatically testing instruction-tuned large language models for code,” arXiv preprint arXiv:2312.14856, 2023.
- S. H. Tan, J. Yi, S. Mechtaev, A. Roychoudhury et al., “Codeflaws: a programming competition benchmark for evaluating automated program repair tools,” in 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 2017, pp. 180–182.
- K. Jesse, T. Ahmed, P. T. Devanbu, and E. Morgan, “Large language models and simple, stupid bugs,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). Los Alamitos, CA, USA: IEEE Computer Society, may 2023, pp. 563–575. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MSR59073.2023.00082
- T. Zhang, C. Gao, L. Ma, M. Lyu, and M. Kim, “An empirical study of common challenges in developing deep learning applications,” in 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), 2019, pp. 104–115.
- M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,” Empirical Software Engineering, vol. 29, no. 1, p. 14, 2024.
- A. Cater-Steel, M. Toleman, and T. Rout, “Addressing the challenges of replications of surveys in software engineering research,” in 2005 International Symposium on Empirical Software Engineering, 2005. IEEE, 2005, pp. 10–pp.
- J. Bogatinovski and O. Kao, “Auto-logging: AI-centred logging instrumentation,” in 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2023, pp. 95–100.
- “Google forms,” https://www.google.ca/forms/about/, 2022.
- Florian Tambon (13 papers)
- Arghavan Moradi Dakhel (5 papers)
- Amin Nikanjam (39 papers)
- Foutse Khomh (140 papers)
- Michel C. Desmarais (6 papers)
- Giuliano Antoniol (21 papers)