Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues (2307.12596v2)

Published 24 Jul 2023 in cs.SE
Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

Abstract: We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

The paper "Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues" explores a systematic paper of ChatGPT's performance in code generation tasks, focusing particularly on its ability to produce reliable and high-quality code in popular programming languages such as Java and Python. The paper involves assessing 4,066 ChatGPT-generated code snippets across 2,033 programming tasks sourced from LeetCode, which encompasses a range of difficulties and temporal introductions of tasks.

Objectives and Methodology:

  1. Correctness Analysis: The research first evaluates the correctness of the generated code by running it against the test suites provided by LeetCode. The pass rates achieved are 66% for Python and 69% for Java, indicating a significant portion of tasks were completed correctly. Notably, the paper examines factors influencing ChatGPT's code generation reliability, including complexity, task introduction time, and code length, revealing diminished effectiveness for newly introduced and lengthier code tasks.
  2. Code Quality Characterization: Despite functional correctness, many code snippets have quality issues, determined through static analysis tools such as Pylint, Flake8, PMD, and CheckStyle. These tools identify common issues like style violations, maintainability challenges, and errors in the output. Distinctly, 47% of the codes have maintainability concerns leading the authors to emphasize the need for refining code beyond mere correctness.
  3. Self-Repair Ability and Mitigation Strategies: The investigation into ChatGPT's capacity to autonomously rectify identified faults via prompts demonstrates partial success, wherein feedback inclusion from static analysis and runtime errors enhances repair effectiveness, improving up to 20% of the code issues.

Findings and Implications:

  • Influence of Task Variables: The performance variance highlights that task difficulty, introduction period, and code size critically affect ChatGPT's generation efficacy.
  • Prevalence of Style and Maintainability Issues: The findings illustrate that a substantial fraction of code, while being functionally correct, suffers from poor styling and maintainability, which could impair long-term code success.
  • Repair Effectiveness: The paper reveals ChatGPT's conditional ability to self-mitigate code quality issues, heavily reliant on the specificity and detail of feedback provided during iterations.

Conclusion and Future Work:

The paper concludes that, while ChatGPT demonstrates strong potential in automating code generation, considerable advancements in mitigating code quality issues are imperative. Further research is encouraged in areas like enhancing prompt engineering and developing interactive feedback loops to bolster ChatGPT's competence in producing more reliable, efficient, and maintainable code. The authors suggest that future improvements should consider augmenting the model with explicit semantic understanding to address current limitations effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
  2. Amazon. 2023. Amazon CodeWhisperer. https://aws.amazon.com/codewhisperer/
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  4. Anonymous Author. 2023. Replication Package for Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. https://github.com/yueyueL/ChatGPT-CodeGenAnalysis
  5. How Android app developers manage power consumption? An empirical study by mining power management commits. In Proceedings of the 13th International Conference on Mining Software Repositories. 37–48.
  6. Oliver Burn. 2003. Checkstyle. http://checkstyle. sourceforge. net/ (2003).
  7. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering (2023).
  8. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  10. Jim Chilton. 2023. The New Risks ChatGPT Poses to Cybersecurity. https://hbr.org/2023/04/the-new-risks-chatgpt-poses-to-cybersecurity
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  12. Tom Copeland. 2005. PMD applied. Vol. 10. Centennial Books San Francisco.
  13. Ian Cordasco and Tarek Ziade. 2010. Flake8: Your tool for style guide enforcement. Programa de computador (2010).
  14. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590 (2023).
  15. Automated Repair of Programs from Large Language Models. In the 45th International Conference on Software Repositories (ICSE).
  16. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547.
  17. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  18. Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. arXiv preprint arXiv:2304.07575 (2023).
  19. GitHub. 2023. GitHub Copilot. https://github.com/features/copilot
  20. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 (2022).
  21. Morey Haber. 2023. Two Cybersecurity Concerns When Using ChatGPT For Software Development. https://www.forbes.com/sites/forbestechcouncil/2023/03/29/two-cybersecurity-concerns-when-using-chatgpt-for-software-development
  22. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210 (2023).
  23. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
  24. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering. 1219–1231.
  25. SimTyper: sound type inference for Ruby using type equality prediction. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1–27.
  26. Code quality issues in student programs. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education. 110–115.
  27. A large scale study of multiple programming languages and code quality. In 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1. IEEE, 563–573.
  28. AutoPruner: transformer-based call graph pruning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 520–532.
  29. Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning. IEEE Transactions on Software Engineering (2023).
  30. LeetCode. 2023. 1093. Statistics from a Large Sample. https://leetcode.com/problems/statistics-from-a-large-sample/description/
  31. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  32. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  33. How practitioners perceive the relevance of software engineering research. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 415–425.
  34. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.
  35. Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
  36. Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR).
  37. Multi-Granularity Detector for Vulnerability Fixes. IEEE Transactions on Software Engineering (2023).
  38. Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations. 70–80.
  39. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
  40. OpenAI. 2023a. ChatGPT Release Notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes
  41. OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  42. OpenAI. 2023c. Model Index For Researchers. https://platform.openai.com/docs/model-index-for-researchers
  43. Carly Page. 2023. Is ChatGPT a cybersecurity threat? https://techcrunch.com/2023/01/11/chatgpt-cybersecurity-threat/
  44. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  45. An empirical study of code smells in javascript projects. In 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, 294–305.
  46. Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey. arXiv preprint arXiv:2310.17903 (2023).
  47. An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 71–82.
  48. SimilarWeb. 2023. ChatGPT’s traffic overview. https://www.similarweb.com/website/chat.openai.com
  49. Donna Spencer. 2009. Card sorting: Designing usable categories. Rosenfeld Media.
  50. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  51. Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual Computer Security Applications Conference. 481–496.
  52. TheGuardian. 2022. ChatGPT reaches 100 million users two months after launch. https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
  53. Sylvain Thénault et al. 2001. Pylint. Code analysis for Python (2001).
  54. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
  55. Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 327–337.
  56. Bug characteristics in blockchain systems: a large-scale empirical study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 413–424.
  57. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  58. Predicting defective lines using a model-agnostic technique. IEEE Transactions on Software Engineering 48, 5 (2020), 1480–1496.
  59. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  60. Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  61. DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1725–1731. https://doi.org/10.18653/v1/2020.findings-emnlp.156
  62. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 70–80.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yue Liu (256 papers)
  2. Thanh Le-Cong (19 papers)
  3. Ratnadira Widyasari (18 papers)
  4. Chakkrit Tantithamthavorn (49 papers)
  5. Li Li (655 papers)
  6. Xuan-Bach D. Le (7 papers)
  7. David Lo (229 papers)
Citations (76)
X Twitter Logo Streamline Icon: https://streamlinehq.com