A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT (2501.03639v1)

Published 7 Jan 2025 in cs.DB and cs.SE

Abstract: The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This study employs a methodological approach, comprising web scraping and data mining from LeetCode, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. In order to gain insight into these matters, this study addresses the question whether GPT-4o produces software of superior quality to that produced by humans. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale. Indeed, the generated code even exhibits significantly lower values across all three metrics in comparison to the user-written code. However, no significantly superior values were observed for the generated code in terms of memory usage in comparison to the user code, which contravened the expectations. Furthermore, it will be demonstrated that GPT-4o encountered challenges in generalising to problems that were not included in the training data set. This contribution presents a first large-scale study comparing generated code with human-written code based on LeetCode platform based on multiple measures including code quality, code understandability, time behaviour and resource utilisation. All data is publicly available for further research.

Summary

The paper compares code generated by GPT-4o with human-written code on LeetCode across quality, understandability, resource usage, and runtime performance metrics.
The study found GPT-4o produced code with significantly fewer code smells and lower cognitive complexity than human submissions, indicating better quality and understandability.
While GPT-4o code showed faster runtime, it consumed more memory and struggled with problems introduced after its training data cut-off, highlighting limitations in generalization.

Analyzing the Transformative Potential of AI in Software Engineering with LeetCode and ChatGPT

This paper presents a comprehensive examination of the impact of Generative AI (GenAI) specifically on software engineering, using a large-scale dataset derived from LeetCode, a well-regarded platform for coding challenges, and OpenAI's GPT-4o model. The paper methodically investigates the quality of code produced by humans versus that generated by GPT-4o across four significant axes of software quality: code quality assessed by code smells, code understandability measured by cognitive complexity, resource utilization in terms of memory usage, and time behavior assessed through runtime performance.

Key Findings

GPT-4o's Code Quality Superiority: The paper demonstrates that GPT-4o's generated code exhibits significantly fewer code smells per thousand lines of code compared to human-written code on LeetCode, suggesting superior code quality. This finding was significant with a moderate effect size, highlighting GPT-4o's capability to produce cleaner code with potentially lower technical debt.
Enhanced Code Understandability: Similarly, code generated by GPT-4o scored lower on cognitive complexity, indicating better understandability. This aspect was confirmed with statistical significance, albeit with a smaller practical effect. This finding suggests that GPT-4o is beneficial in producing comprehensible code which is easy to maintain and supports readability.
Performance Efficiency Insights: While GPT-4o showed superiority in runtime efficiency, with faster code execution times compared to the average user submission on LeetCode, it failed to demonstrate a memory usage advantage. Indeed, the generated solutions are more efficient in terms of runtime but not in resource utilization, as indicated by the higher memory consumption rank compared to the median human solutions.
Limitations in Generalization: A revealing insight from the paper is the challenge faced by GPT-4o with problems introduced after its training data cut-off, suggesting limitations in out-of-distribution generalization, an important consideration for deploying such models in dynamic environments.

Implications for AI in Software Development

The results of this paper underscore the nuanced advantages and limitations posed by AI models in software engineering:

Practical Utility and Quality: GPT-4o's ability to produce code with fewer smells and enhanced readability demonstrates real value as a tool in software engineering, supporting professionals in writing better quality code without extensive refactoring needs.
Runtime vs. Memory Trade-offs: The enhanced runtime performance yet suboptimal memory usage signals the necessity for further refinement in AI training processes, ideally optimizing AI models to balance both these crucial performance elements effectively.
Generalization Capabilities: The difficulty in solving new problems introduced after the training cut-off date highlights areas for improvement in AI models' robustness and capability to handle unseen or novel problems, an area ripe for future research and technological development.

Future Directions

The paper sets a benchmark in the paper of AI's impact on software engineering, paving the way for multiple avenues of future research. Extending analyses to other programming languages and versions of LLMs will provide deeper insights into the consistent performance across platforms. Additionally, integrating these findings with user feedback on platforms like LeetCode can help fine-tune AI models, maximizing their practical applicability. The exploration of different AI models and their comparative efficacy across varied problem sets represents an intriguing direction to enhance our understanding of AI's role in reimagining software development paradigms.

PDF Markdown