Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
Abstract: We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.
- Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
- Amazon. 2023. Amazon CodeWhisperer. https://aws.amazon.com/codewhisperer/
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
- Anonymous Author. 2023. Replication Package for Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. https://github.com/yueyueL/ChatGPT-CodeGenAnalysis
- How Android app developers manage power consumption? An empirical study by mining power management commits. In Proceedings of the 13th International Conference on Mining Software Repositories. 37–48.
- Oliver Burn. 2003. Checkstyle. http://checkstyle. sourceforge. net/ (2003).
- MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering (2023).
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Jim Chilton. 2023. The New Risks ChatGPT Poses to Cybersecurity. https://hbr.org/2023/04/the-new-risks-chatgpt-poses-to-cybersecurity
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- Tom Copeland. 2005. PMD applied. Vol. 10. Centennial Books San Francisco.
- Ian Cordasco and Tarek Ziade. 2010. Flake8: Your tool for style guide enforcement. Programa de computador (2010).
- Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590 (2023).
- Automated Repair of Programs from Large Language Models. In the 45th International Conference on Software Repositories (ICSE).
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
- Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. arXiv preprint arXiv:2304.07575 (2023).
- GitHub. 2023. GitHub Copilot. https://github.com/features/copilot
- News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).
- Morey Haber. 2023. Two Cybersecurity Concerns When Using ChatGPT For Software Development. https://www.forbes.com/sites/forbestechcouncil/2023/03/29/two-cybersecurity-concerns-when-using-chatgpt-for-software-development
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210 (2023).
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
- Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering. 1219–1231.
- SimTyper: sound type inference for Ruby using type equality prediction. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1–27.
- Code quality issues in student programs. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education. 110–115.
- A large scale study of multiple programming languages and code quality. In 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1. IEEE, 563–573.
- AutoPruner: transformer-based call graph pruning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 520–532.
- Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning. IEEE Transactions on Software Engineering (2023).
- LeetCode. 2023. 1093. Statistics from a Large Sample. https://leetcode.com/problems/statistics-from-a-large-sample/description/
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
- How practitioners perceive the relevance of software engineering research. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 415–425.
- Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.
- Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
- Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR).
- Multi-Granularity Detector for Vulnerability Fixes. IEEE Transactions on Software Engineering (2023).
- Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations. 70–80.
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
- OpenAI. 2023a. ChatGPT Release Notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes
- OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774Â [cs.CL]
- OpenAI. 2023c. Model Index For Researchers. https://platform.openai.com/docs/model-index-for-researchers
- Carly Page. 2023. Is ChatGPT a cybersecurity threat? https://techcrunch.com/2023/01/11/chatgpt-cybersecurity-threat/
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- An empirical study of code smells in javascript projects. In 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, 294–305.
- Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey. arXiv preprint arXiv:2310.17903 (2023).
- An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 71–82.
- SimilarWeb. 2023. ChatGPT’s traffic overview. https://www.similarweb.com/website/chat.openai.com
- Donna Spencer. 2009. Card sorting: Designing usable categories. Rosenfeld Media.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual Computer Security Applications Conference. 481–496.
- TheGuardian. 2022. ChatGPT reaches 100 million users two months after launch. https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
- Sylvain Thénault et al. 2001. Pylint. Code analysis for Python (2001).
- Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
- Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 327–337.
- Bug characteristics in blockchain systems: a large-scale empirical study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 413–424.
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
- Predicting defective lines using a model-agnostic technique. IEEE Transactions on Software Engineering 48, 5 (2020), 1480–1496.
- Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
- Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
- DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1725–1731. https://doi.org/10.18653/v1/2020.findings-emnlp.156
- Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 70–80.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.