Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring

Published 27 Jan 2024 in cs.SE | (2401.15298v2)

Abstract: Long methods that encapsulate multiple responsibilities within a single method are challenging to maintain. Choosing which statements to extract into new methods has been the target of many research tools. Despite steady improvements, these tools often fail to generate refactorings that align with developers' preferences and acceptance criteria. Given that LLMs have been trained on large code corpora, if we harness their familiarity with the way developers form functions, we could suggest refactorings that developers are likely to accept. In this paper, we advance the science and practice of refactoring by synergistically combining the insights of LLMs with the power of IDEs to perform Extract Method (EM). Our formative study on 1752 EM scenarios revealed that LLMs are very effective for giving expert suggestions, yet they are unreliable: up to 76.3% of the suggestions are hallucinations. We designed a novel approach that removes hallucinations from the candidates suggested by LLMs, then further enhances and ranks suggestions based on static analysis techniques from program slicing, and finally leverages the IDE to execute refactorings correctly. We implemented this approach in an IntelliJ IDEA plugin called EM-Assist. We empirically evaluated EM-Assist on a diverse corpus that replicates 1752 actual refactorings from open-source projects. We found that EM-Assist outperforms previous state of the art tools: EM-Assist suggests the developerperformed refactoring in 53.4% of cases, improving over the recall rate of 39.4% for previous best-in-class tools. Furthermore, we conducted firehouse surveys with 16 industrial developers and suggested refactorings on their recent commits. 81.3% of them agreed with the recommendations provided by EM-Assist.

Abstract PDF HTML Chat (Pro)

References (70)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces EM-Assist, which integrates LLM-generated suggestions with IDE static analysis to filter, enhance, and rank extract method refactoring proposals.
It employs a multi-stage workflow that reduces invalid or impractical suggestions by up to 76% while boosting Recall@5 by up to 26 percentage points.
Experimental results demonstrate that EM-Assist outperforms state-of-the-art tools, achieving a Recall@5 of 53.4% on realistic datasets with 81.3% developer approval.

This paper introduces EM-Assist, an IntelliJ IDEA plugin designed to improve the "Extract Method" (EM) refactoring process by combining the pattern-recognition capabilities of LLMs with the precise static analysis features of Integrated Development Environments (IDEs). The core problem addressed is that while long methods are detrimental to code maintainability, existing automated tools often suggest refactorings based on software metrics that don't align with developers' practical preferences and acceptance criteria.

The authors hypothesize that LLMs, trained on vast codebases, can capture the nuances of how developers structure methods, leading to more acceptable suggestions. However, a formative study revealed that while LLMs (specifically GPT-3.5, GPT-4, PaLM) are prolific generators of EM suggestions, a significant portion (up to 76.3%) are "hallucinations" – either invalid (e.g., leading to compilation errors, ~57.4%) or not useful (e.g., extracting only one line or the entire method body, ~18.9%).

EM-Assist tackles this by implementing a multi-stage workflow:

Generate Suggestions: Prompts an LLM (GPT-3.5 found most effective) using few-shot learning to generate a diverse set of potential code fragments to extract from a target method. It iterates multiple times with varying "temperature" settings to maximize suggestion diversity.
Remove Invalid Suggestions: Leverages the IDE's static analysis capabilities (specifically, the IntelliJ Platform's refactoring precondition checks) to filter out suggestions that would result in non-compilable code due to issues like scope violations, incorrect handling of return values or control flow.
Remove Not Useful Suggestions: Filters out suggestions that are too large (e.g., >88% of the original method) or too small (e.g., single lines), as these typically offer little practical benefit for code renovation.
Enhance Suggestions: Applies heuristics based on program slicing and control flow analysis to refine the remaining valid suggestions. For example, it might expand a suggestion to include a relevant variable declaration (reducing parameters) or shrink it to exclude an if condition (improving readability).
Rank Suggestions: Prioritizes the enhanced suggestions using a scoring mechanism that combines "heat" (frequency of individual lines appearing across all suggestions) and "popularity" (frequency of the exact suggestion appearing during LLM iterations).
Apply Refactoring: Presents the top-ranked suggestions to the developer. Once a suggestion is chosen, EM-Assist uses the IDE's reliable EM refactoring engine to execute the code transformation safely.

The evaluation demonstrated:

LLM Performance: LLMs are effective generators but require significant filtering; only ~23.7% of raw suggestions were deemed useful. GPT-3.5 provided the best balance of useful suggestions vs. hallucinations.
Parameter Tuning: Higher LLM temperature (e.g., 1.2) and more iterations (e.g., 10) combined with EM-Assist's filtering/ranking yielded the best results (Recall@5 of 63% on a standard benchmark). The enhancement and ranking steps significantly boosted recall over raw LLM output (by up to 26 percentage points).
Comparison with State-of-the-Art: On a standard benchmark ("Community Corpus", 122 examples), EM-Assist slightly outperformed previous static analysis (JDeodorant, JExtract, SEMI, LiveRef) and ML-based tools (GEMS, REMS). Crucially, on a larger, more realistic dataset ("Extended Corpus", 1752 actual developer-performed refactorings), EM-Assist showed a much larger improvement, achieving a Recall@5 of 53.4% compared to 39.4% for the best previous tool (JExtract), indicating better alignment with real-world practices.
Developer Usefulness: Firehouse surveys with 16 industrial developers working on mature projects (IntelliJ IDEA CE, JetBrains Runtime) showed high acceptance: 81.3% found EM-Assist's suggestions useful and potentially applicable to their code.

The paper concludes that synergistically combining LLMs for creative suggestion generation and IDE static analysis for validation and safe execution is a promising approach for refactoring tools. EM-Assist represents a step towards AI assistants that effectively augment developer workflows for code renovation, providing suggestions more aligned with human intuition while ensuring correctness.