Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study on the Potential of LLMs in Automated Software Refactoring (2411.04444v1)

Published 7 Nov 2024 in cs.SE

Abstract: Recent advances in LLMs, make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6% to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring. To this end, we propose a detect-and-reapply tactic, called RefactoringMirror, to avoid such unsafe refactorings. By reapplying the identified refactorings to the original code using thoroughly tested refactoring engines, we can effectively mitigate the risks associated with LLM-based automated refactoring while still leveraging LLM's intelligence to obtain valuable refactoring recommendations.

Summary

  • The paper demonstrates that explicit prompts can raise LLMs' identification rate for refactoring opportunities from 15.6% to 52.2%, particularly for extract method cases.
  • The paper shows that over 60% of the refactoring solutions suggested by LLMs match or exceed human-crafted approaches, though about 7% are unsafe due to semantic errors.
  • The paper introduces the RefactoringMirror tactic, which reprocesses LLM outputs using established refactoring engines to correct detected errors and ensure safety.

An Empirical Study on Evaluating the Potential of LLMs in Automated Software Refactoring

Refactoring, a cornerstone of maintaining and enhancing software quality, traditionally involves improving readability, maintainability, and reusability without altering functionality. Numerous tools and methodologies have emerged to support this task, yet they often require significant developer input, particularly in identifying opportunities and applying solutions. This paper explores an empirical paper, analyzing the application of LLMs, such as ChatGPT and Gemini, in automating software refactoring efforts and comparing their performance against human experts.

Key contributions of this research include constructing a high-quality dataset of 180 real-world refactorings across 20 software projects to systematically quantify the capability of LLMs in both identifying refactoring opportunities and recommending solutions.

Identification of Refactoring Opportunities

The paper elucidates the limitations and capabilities of LLMs in detecting refactoring opportunities through varying levels of prompt specificity. Initially, with general prompts, both GPT and Gemini identified a modest percentage (15.6% and 3.9%, respectively) of the actual refactoring opportunities. Notably, GPT demonstrated a stronger ability in recognizing extract method refactorings compared to Gemini, underscoring variability in model performance contingent upon the refactoring type.

Introducing prompts with explicit refactoring types significantly bolstered the identification success rate for both models—GPT's rate rose to 52.2% while Gemini's increased to 21.1%. This indicates the crucial role of detailed guidance in leveraging LLMs effectively. Despite these enhancements, inline refactorings remained particularly challenging for both models, suggesting room for algorithmic improvements. Furthermore, the paper reports a negative correlation between the size of source code and success rates, underscoring the importance of concise inputs for optimal LLM performance.

Recommendation of Refactoring Solutions

When LLMs were tasked with recommending refactoring solutions, a notable success emerged: over 60% of solutions proposed by both LLMs were deemed comparable to or better than human-crafted suggestions. However, a critical insight emerged in the field of safety. Approximately 7% of solutions from both models were unsafe, predominantly due to semantic alterations or syntax errors, emphasizing the need for rigorous validation processes to ensure reliability in practical applications.

The Detect-and-Reapply Tactic

To mitigate risks posed by unsafe recommendations, the authors propose the "RefactoringMirror" tactic, a process of detecting LLM-suggested refactorings and then reapplying them through established refactoring engines like IntelliJ IDEA. This method successfully rectified all buggy refactorings previously introduced by LLMs, albeit falling short in some areas due to limitations of current refactoring detection and execution tools.

Implications and Future Directions

This paper positions LLMs as promising aids in automated refactoring by highlighting their capabilities in the context of detailed prompts and post-processing techniques. Nevertheless, it underscores that LLMs should supplement, rather than replace, human expertise due to ongoing issues with reliability and safety. Future research should intensively focus on improving LLM design to boost their understanding of broader refactoring contexts and reduce the incidence of semantic alterations.

Further exploration might integrate sophisticated pre-and post-processing techniques and augmented prompt engineering to better align LLMs with complex coding tasks. This research, while advancing knowledge in leveraging LLMs for code refactoring, also serves as a reminder of the nuanced interaction between artificial intelligence and the inherently creative field of software engineering.

X Twitter Logo Streamline Icon: https://streamlinehq.com