- The paper demonstrates that explicit prompts can raise LLMs' identification rate for refactoring opportunities from 15.6% to 52.2%, particularly for extract method cases.
- The paper shows that over 60% of the refactoring solutions suggested by LLMs match or exceed human-crafted approaches, though about 7% are unsafe due to semantic errors.
- The paper introduces the RefactoringMirror tactic, which reprocesses LLM outputs using established refactoring engines to correct detected errors and ensure safety.
An Empirical Study on Evaluating the Potential of LLMs in Automated Software Refactoring
Refactoring, a cornerstone of maintaining and enhancing software quality, traditionally involves improving readability, maintainability, and reusability without altering functionality. Numerous tools and methodologies have emerged to support this task, yet they often require significant developer input, particularly in identifying opportunities and applying solutions. This paper explores an empirical paper, analyzing the application of LLMs, such as ChatGPT and Gemini, in automating software refactoring efforts and comparing their performance against human experts.
Key contributions of this research include constructing a high-quality dataset of 180 real-world refactorings across 20 software projects to systematically quantify the capability of LLMs in both identifying refactoring opportunities and recommending solutions.
Identification of Refactoring Opportunities
The paper elucidates the limitations and capabilities of LLMs in detecting refactoring opportunities through varying levels of prompt specificity. Initially, with general prompts, both GPT and Gemini identified a modest percentage (15.6% and 3.9%, respectively) of the actual refactoring opportunities. Notably, GPT demonstrated a stronger ability in recognizing extract method refactorings compared to Gemini, underscoring variability in model performance contingent upon the refactoring type.
Introducing prompts with explicit refactoring types significantly bolstered the identification success rate for both models—GPT's rate rose to 52.2% while Gemini's increased to 21.1%. This indicates the crucial role of detailed guidance in leveraging LLMs effectively. Despite these enhancements, inline refactorings remained particularly challenging for both models, suggesting room for algorithmic improvements. Furthermore, the paper reports a negative correlation between the size of source code and success rates, underscoring the importance of concise inputs for optimal LLM performance.
Recommendation of Refactoring Solutions
When LLMs were tasked with recommending refactoring solutions, a notable success emerged: over 60% of solutions proposed by both LLMs were deemed comparable to or better than human-crafted suggestions. However, a critical insight emerged in the field of safety. Approximately 7% of solutions from both models were unsafe, predominantly due to semantic alterations or syntax errors, emphasizing the need for rigorous validation processes to ensure reliability in practical applications.
The Detect-and-Reapply Tactic
To mitigate risks posed by unsafe recommendations, the authors propose the "RefactoringMirror" tactic, a process of detecting LLM-suggested refactorings and then reapplying them through established refactoring engines like IntelliJ IDEA. This method successfully rectified all buggy refactorings previously introduced by LLMs, albeit falling short in some areas due to limitations of current refactoring detection and execution tools.
Implications and Future Directions
This paper positions LLMs as promising aids in automated refactoring by highlighting their capabilities in the context of detailed prompts and post-processing techniques. Nevertheless, it underscores that LLMs should supplement, rather than replace, human expertise due to ongoing issues with reliability and safety. Future research should intensively focus on improving LLM design to boost their understanding of broader refactoring contexts and reduce the incidence of semantic alterations.
Further exploration might integrate sophisticated pre-and post-processing techniques and augmented prompt engineering to better align LLMs with complex coding tasks. This research, while advancing knowledge in leveraging LLMs for code refactoring, also serves as a reminder of the nuanced interaction between artificial intelligence and the inherently creative field of software engineering.