Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues (2307.12596v2)

Published 24 Jul 2023 in cs.SE

Abstract: We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

PDF HTML Abstract

The paper "Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues" explores a systematic paper of ChatGPT's performance in code generation tasks, focusing particularly on its ability to produce reliable and high-quality code in popular programming languages such as Java and Python. The paper involves assessing 4,066 ChatGPT-generated code snippets across 2,033 programming tasks sourced from LeetCode, which encompasses a range of difficulties and temporal introductions of tasks.

Objectives and Methodology:

Correctness Analysis: The research first evaluates the correctness of the generated code by running it against the test suites provided by LeetCode. The pass rates achieved are 66% for Python and 69% for Java, indicating a significant portion of tasks were completed correctly. Notably, the paper examines factors influencing ChatGPT's code generation reliability, including complexity, task introduction time, and code length, revealing diminished effectiveness for newly introduced and lengthier code tasks.
Code Quality Characterization: Despite functional correctness, many code snippets have quality issues, determined through static analysis tools such as Pylint, Flake8, PMD, and CheckStyle. These tools identify common issues like style violations, maintainability challenges, and errors in the output. Distinctly, 47% of the codes have maintainability concerns leading the authors to emphasize the need for refining code beyond mere correctness.
Self-Repair Ability and Mitigation Strategies: The investigation into ChatGPT's capacity to autonomously rectify identified faults via prompts demonstrates partial success, wherein feedback inclusion from static analysis and runtime errors enhances repair effectiveness, improving up to 20% of the code issues.

Findings and Implications:

Influence of Task Variables: The performance variance highlights that task difficulty, introduction period, and code size critically affect ChatGPT's generation efficacy.
Prevalence of Style and Maintainability Issues: The findings illustrate that a substantial fraction of code, while being functionally correct, suffers from poor styling and maintainability, which could impair long-term code success.
Repair Effectiveness: The paper reveals ChatGPT's conditional ability to self-mitigate code quality issues, heavily reliant on the specificity and detail of feedback provided during iterations.

Conclusion and Future Work:

The paper concludes that, while ChatGPT demonstrates strong potential in automating code generation, considerable advancements in mitigating code quality issues are imperative. Further research is encouraged in areas like enhancing prompt engineering and developing interactive feedback loops to bolster ChatGPT's competence in producing more reliable, efficient, and maintainable code. The authors suggest that future improvements should consider augmenting the model with explicit semantic understanding to address current limitations effectively.

PDF Markdown Bookmark Chat (Pro)

References (62)

Authors (7)

Yue Liu (256 papers)
Thanh Le-Cong (19 papers)
Ratnadira Widyasari (18 papers)
Chakkrit Tantithamthavorn (49 papers)
Li Li (655 papers)
Xuan-Bach D. Le (7 papers)
David Lo (229 papers)

Citations (76)

View on Semantic Scholar

GitHub

GitHub - yueyueL/ChatGPT-CodeGenAnalysis: Exploring and improving the quality of ChatGPT-generated code for LeetCode programming tasks. (11 stars)

Tweets

https://twitter.com/Spednar/status/1804271850387837337

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues (2307.12596v2)

Related Papers

GitHub

Tweets