- The paper demonstrates that about 15% of Copilot-generated Python files contain code smells, with 'Multiply-Nested Container' being the most prevalent.
- The study employs a keyword-based extraction method and Pysmell for detecting and classifying code smells within dynamically typed Python code.
- It reveals that detailed prompt structures in Copilot Chat significantly improve code quality by effectively addressing code smells, although they may sometimes introduce new issues.
Introduction
AI-assisted development tools like GitHub Copilot have been evolving, becoming an integral part of many coding workflows. While these tools bring the power of LLMs to developers, prompting them with contextual code snippets, the question of code quality—specifically, the prevalence of code smells in the output—has been a notable concern. Beiqi Zhang, Peng Liang, Qiong Feng, Yujia Fu, and Zengyang Li have undertaken a structured evaluation of code smells within code generated by GitHub Copilot for Python, analyzing the efficacy of Copilot Chat in addressing these smells.
Methodology
The researchers assembled a dataset with 102 instances of code smells sourced from Python code that Copilot generated. Emphasizing Python's dynamic nature, they identified smells that could impact readability and maintainability. The evaluation focused on two key research questions: the extent of code smell occurrence in the generated code and Copilot Chat's effectiveness in remediation.
Utilizing a keyword-based mining approach, the team extracted relevant Python files from GitHub. Subsequently, Pysmell, a detection tool tailored for Python smells, scanned these files. The paper distinguished between various types of detected smells and meticulously ensured that all listed code smells indeed originated from Copilot-generated code.
Results
The researchers noted that approximately 15% of the evaluated files contained code smells, with "Multiply-Nested Container" being most prevalent. They found that Copilot-generated Python code wasn't immune to such suboptimal code patterns, which could potentially increase error proneness and hinder code maintainability.
The paper also shed light on the newer feature, Copilot Chat, a beta service positioned to enhance code quality through interaction in natural language. By leveraging different prompt structures, the research evaluated Copilot Chat’s capability to address and rectify the detected smells. Interestingly, a more detailed prompt structure was found to be significantly more effective.
Discussion and Implications
The paper indicates that while Copilot Chat shows promise in addressing Python code smells, it can also potentially introduce additional smells during the fixing process. Hence, developers should remain cautious, employing detailed and specific prompts to guide Copilot Chat effectively. The insights derived could inform the enhancement of automated code generation tools, ensuring enhanced code quality and reduced technical debt in AI-assisted development environments.
Further exploration into the handling of code smells across different languages and the effects of diversified prompt structures could pave the way to more robust AI-powered coding assistants. This research provides a foundation for continued investigation into the enhancement of code generation tools, optimizing them for industry-scale usage without compromising on code quality.