Papers
Topics
Authors
Recent
2000 character limit reached

Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring

Published 27 Jan 2024 in cs.SE | (2401.15298v2)

Abstract: Long methods that encapsulate multiple responsibilities within a single method are challenging to maintain. Choosing which statements to extract into new methods has been the target of many research tools. Despite steady improvements, these tools often fail to generate refactorings that align with developers' preferences and acceptance criteria. Given that LLMs have been trained on large code corpora, if we harness their familiarity with the way developers form functions, we could suggest refactorings that developers are likely to accept. In this paper, we advance the science and practice of refactoring by synergistically combining the insights of LLMs with the power of IDEs to perform Extract Method (EM). Our formative study on 1752 EM scenarios revealed that LLMs are very effective for giving expert suggestions, yet they are unreliable: up to 76.3% of the suggestions are hallucinations. We designed a novel approach that removes hallucinations from the candidates suggested by LLMs, then further enhances and ranks suggestions based on static analysis techniques from program slicing, and finally leverages the IDE to execute refactorings correctly. We implemented this approach in an IntelliJ IDEA plugin called EM-Assist. We empirically evaluated EM-Assist on a diverse corpus that replicates 1752 actual refactorings from open-source projects. We found that EM-Assist outperforms previous state of the art tools: EM-Assist suggests the developerperformed refactoring in 53.4% of cases, improving over the recall rate of 39.4% for previous best-in-class tools. Furthermore, we conducted firehouse surveys with 16 industrial developers and suggested refactorings on their recent commits. 81.3% of them agreed with the recommendations provided by EM-Assist.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Fine slicing for advanced method extraction. In 3rd workshop on refactoring tools, Vol. 21.
  2. From Commit Message Generation to History-Aware Commit Message Completion. (ASE 2023). https://arxiv.org/pdf/2308.07655.pdf
  3. AntiCopyPaster: Extracting Code Duplicates As Soon As They Are Introduced in the IDE. 1–4. https://doi.org/10.1145/3551349.3559537
  4. Just-in-time code duplicates extraction. Information and Software Technology 158 (02 2023), 107169. https://doi.org/10.1016/j.infsof.2023.107169
  5. On the documentation of refactoring types. Automated Software Engineering 29 (2022), 1–40.
  6. The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Transactions on Software Engineering 48, 4 (2020), 1432–1450.
  7. Anonymous. 2023. Replication Package at GitHub. https://llm-refactoring.github.io/
  8. Anthropic. 2023. Introducing Claude. https://www.anthropic.com/index/introducing-claude
  9. Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Information and Software Technology 108 (2019), 115–138.
  10. Software complexity and maintenance costs. Commun. ACM 36, 11 (1993), 81–95.
  11. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa arXiv:https://www.tandfonline.com/doi/pdf/10.1191/1478088706qp063oa
  12. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  13. Coding In-depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement. Sociological Methods & Research 42, 3 (2013), 294–320. https://doi.org/10.1177/0049124113500475
  14. Identifying Extract Method Refactoring Opportunities Based on Functional Relevance. IEEE Transactions on Software Engineering 43, 10 (2017), 954–974. https://doi.org/10.1109/TSE.2016.2645572
  15. An Empirical Study on the Usage of Transformer Models for Code Completion. IEEE Transactions on Software Engineering 48, 12 (2022), 4818–4837. https://doi.org/10.1109/TSE.2021.3128234
  16. Bradley E. Cossette and Robert J. Walker. 2012. Seeking the Ground Truth: A Retroactive Study on the Evolution and Migration of Software Libraries (FSE ’12). Association for Computing Machinery, New York, NY, USA, Article 55, 11 pages. https://doi.org/10.1145/2393596.2393661
  17. Daniela S. Cruzes and Tore DybĂ¥. 2011. Research synthesis in software engineering: A tertiary study. Information and Software Technology 53, 5 (2011), 440–455. https://doi.org/10.1016/j.infsof.2011.01.004 Special Section on Best Papers from XP2010.
  18. REMS: Recommending Extract Method Refactoring Opportunities via Multi-view Representation of Code Property Graph. In 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). 191–202. https://doi.org/10.1109/ICPC58990.2023.00034
  19. Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by Example. In 32nd ACM Symposium on the Foundations of Software Engineering (FSE ’24). to appear.
  20. Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. ACM Trans. Softw. Eng. Methodol. 30, 4, Article 55 (jul 2021), 42 pages. https://doi.org/10.1145/3453478
  21. Discovering repetitive code changes in python ML systems. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 736–748. https://doi.org/10.1145/3510003.3510225
  22. Improving source code readability: Theory and practice. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 2–12.
  23. Falcon. 2023. Falcon. https://falconllm.tii.ae
  24. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  25. A Live Environment to Improve the Refactoring Experience. In Companion Proceedings of the 6th International Conference on the Art, Science, and Engineering of Programming. 30–37.
  26. LiveRef: A Tool for Live Refactoring Java Code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 161, 4 pages. https://doi.org/10.1145/3551349.3559532
  27. Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley.
  28. What Makes Good In-context Demonstrations for Code Intelligence Tasks with LLMs?. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE ‘38). ACM. https://arxiv.org/abs/2304.07575
  29. GoogleAI. 2023. Google Bard: An Early Experiment with Generative AI. https://ai.google/static/documents/google-about-bard.pdf
  30. Roman Haas and Benjamin Hummel. 2015. Deriving extract method refactoring suggestions for long methods. In International Conference on Software Quality. Springer, 144–155.
  31. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint arXiv:2308.10620 (2023).
  32. JetBrains. 2023a. CoreNLP. (2023). https://github.com/stanfordnlp/CoreNLP
  33. JetBrains. 2023b. IntelliJ Community Edition. (2023). https://github.com/JetBrains/intellij-community
  34. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL]
  35. Arun Lakhotia and Jean-Christophe Deprez. 1998. Restructuring programs by tucking statements into functions. Information and Software Technology 40, 11-12 (1998), 677–689.
  36. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2020).
  37. The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation. arXiv:2305.06156 [cs.CL]
  38. Robert C. Martin. 2017. Clean Architecture: A Craftsman’s Guide to Software Structure and Design (1st ed.). Prentice Hall Press, USA.
  39. Katsuhisa Maruyama. 2001. Automated Method-Extraction Refactoring by Using Block-Based Slicing. In Proceedings of the 2001 Symposium on Software Reusability: Putting Software Reuse in Context (Toronto, Ontario, Canada) (SSR ’01). Association for Computing Machinery, New York, NY, USA, 31–40. https://doi.org/10.1145/375212.375233
  40. Refactoring opportunity identification methodology for removing long method smells and improving code analyzability. IEICE TRANSACTIONS on Information and Systems 101, 7 (2018), 1766–1779.
  41. Meta. 2023. Introducing Llama. https://ai.meta.com/llama/
  42. Emerson Murphy-Hill and Andrew P Black. 2008. Breaking the barriers to successful refactoring: observations and tools for extract method. In Proceedings of the 30th international conference on Software engineering. 421–430.
  43. How We Refactor, and How We Know It. IEEE Transactions on Software Engineering 38, 1 (2012), 5–18. https://doi.org/10.1109/TSE.2011.41
  44. The Design Space of Bug Fixes and How Developers Navigate It. IEEE Transactions on Software Engineering 41, 1 (2015), 65–81. https://doi.org/10.1109/TSE.2014.2357438
  45. A Comparative Study of Manual and Automated Refactorings. In ECOOP 2013 – Object-Oriented Programming, Giuseppe Castagna (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 552–576.
  46. OpenAI. 2023. GPT-4 Technical Report. (2023). https://arxiv.org/pdf/2303.08774.pdf
  47. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).
  48. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  49. MELT: Mining Effective Lightweight Transformations from Pull Requests. (ASE 2023). https://arxiv.org/abs/2308.14687
  50. Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 314, 7 pages. https://doi.org/10.1145/3411763.3451760
  51. Comparing commit messages and source code metrics for the prediction refactoring activities. Algorithms 14, 10 (2021), 289.
  52. Automatically assessing code understandability. IEEE Transactions on Software Engineering 47, 3 (2019), 595–613.
  53. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  54. Feng Sidong and Chen Chunyang. 2024. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. In Proceedings of the 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 10 pages.
  55. Recommending automated extract method refactorings. In Proceedings of the 22nd International Conference on Program Comprehension. 146–156.
  56. Jextract: An eclipse plug-in for recommending automated extract method refactorings. arXiv preprint arXiv:1506.06086 (2015).
  57. Why We Refactor? Confessions of GitHub Contributors (FSE 2016). Association for Computing Machinery, New York, NY, USA, 858–870. https://doi.org/10.1145/2950290.2950305
  58. Software Engineering Data Collection for Field Studies. Springer London, London, 9–34. https://doi.org/10.1007/978-1-84800-044-5_1
  59. Robert Tairas and Jeff Gray. 2012. Increasing clone maintenance support by unifying clone detection and refactoring activities. Information and Software Technology 54, 12 (2012), 1297–1307.
  60. Omkarendra Tiwari and Rushikesh Joshi. 2022. Identifying Extract Method Refactorings. In 15th Innovations in Software Engineering Conference (Gandhinagar, India) (ISEC 2022). Association for Computing Machinery, New York, NY, USA, Article 7, 11 pages. https://doi.org/10.1145/3511430.3511435
  61. Nikolaos Tsantalis and Alexander Chatzigeorgiou. 2011. Identification of extract method refactoring opportunities for the decomposition of methods. Journal of Systems and Software 84, 10 (2011), 1757–1782. https://doi.org/10.1016/j.jss.2011.05.016
  62. RefactoringMiner 2.0. IEEE Transactions on Software Engineering 48, 3 (2022), 930–950. https://doi.org/10.1109/TSE.2020.3007722
  63. Data-driven extract method recommendations: a study at ING. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1337–1347.
  64. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
  65. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
  66. GEMS: An Extract Method Refactoring Recommender. In 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE). 24–34. https://doi.org/10.1109/ISSRE.2017.35
  67. Identifying Fragments to Be Extracted from Long Methods. In Proceedings of the 2009 16th Asia-Pacific Software Engineering Conference (APSEC ’09). IEEE Computer Society, USA, 43–49. https://doi.org/10.1109/APSEC.2009.20
  68. Identifying fragments to be extracted from long methods. In 2009 16th Asia-Pacific Software Engineering Conference. IEEE, 43–49.
  69. Proactive clone recommendation system for extract method refactoring. In 2019 IEEE/ACM 3rd International Workshop on Refactoring (IWoR). IEEE, 67–70.
  70. Automatic clone recommendation for refactoring based on the present and the past. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 115–126.
Citations (4)

Summary

  • The paper introduces EM-Assist, which integrates LLM-generated suggestions with IDE static analysis to filter, enhance, and rank extract method refactoring proposals.
  • It employs a multi-stage workflow that reduces invalid or impractical suggestions by up to 76% while boosting Recall@5 by up to 26 percentage points.
  • Experimental results demonstrate that EM-Assist outperforms state-of-the-art tools, achieving a Recall@5 of 53.4% on realistic datasets with 81.3% developer approval.

This paper introduces EM-Assist, an IntelliJ IDEA plugin designed to improve the "Extract Method" (EM) refactoring process by combining the pattern-recognition capabilities of LLMs with the precise static analysis features of Integrated Development Environments (IDEs). The core problem addressed is that while long methods are detrimental to code maintainability, existing automated tools often suggest refactorings based on software metrics that don't align with developers' practical preferences and acceptance criteria.

The authors hypothesize that LLMs, trained on vast codebases, can capture the nuances of how developers structure methods, leading to more acceptable suggestions. However, a formative study revealed that while LLMs (specifically GPT-3.5, GPT-4, PaLM) are prolific generators of EM suggestions, a significant portion (up to 76.3%) are "hallucinations" – either invalid (e.g., leading to compilation errors, ~57.4%) or not useful (e.g., extracting only one line or the entire method body, ~18.9%).

EM-Assist tackles this by implementing a multi-stage workflow:

  1. Generate Suggestions: Prompts an LLM (GPT-3.5 found most effective) using few-shot learning to generate a diverse set of potential code fragments to extract from a target method. It iterates multiple times with varying "temperature" settings to maximize suggestion diversity.
  2. Remove Invalid Suggestions: Leverages the IDE's static analysis capabilities (specifically, the IntelliJ Platform's refactoring precondition checks) to filter out suggestions that would result in non-compilable code due to issues like scope violations, incorrect handling of return values or control flow.
  3. Remove Not Useful Suggestions: Filters out suggestions that are too large (e.g., >88% of the original method) or too small (e.g., single lines), as these typically offer little practical benefit for code renovation.
  4. Enhance Suggestions: Applies heuristics based on program slicing and control flow analysis to refine the remaining valid suggestions. For example, it might expand a suggestion to include a relevant variable declaration (reducing parameters) or shrink it to exclude an if condition (improving readability).
  5. Rank Suggestions: Prioritizes the enhanced suggestions using a scoring mechanism that combines "heat" (frequency of individual lines appearing across all suggestions) and "popularity" (frequency of the exact suggestion appearing during LLM iterations).
  6. Apply Refactoring: Presents the top-ranked suggestions to the developer. Once a suggestion is chosen, EM-Assist uses the IDE's reliable EM refactoring engine to execute the code transformation safely.

The evaluation demonstrated:

  • LLM Performance: LLMs are effective generators but require significant filtering; only ~23.7% of raw suggestions were deemed useful. GPT-3.5 provided the best balance of useful suggestions vs. hallucinations.
  • Parameter Tuning: Higher LLM temperature (e.g., 1.2) and more iterations (e.g., 10) combined with EM-Assist's filtering/ranking yielded the best results (Recall@5 of 63% on a standard benchmark). The enhancement and ranking steps significantly boosted recall over raw LLM output (by up to 26 percentage points).
  • Comparison with State-of-the-Art: On a standard benchmark ("Community Corpus", 122 examples), EM-Assist slightly outperformed previous static analysis (JDeodorant, JExtract, SEMI, LiveRef) and ML-based tools (GEMS, REMS). Crucially, on a larger, more realistic dataset ("Extended Corpus", 1752 actual developer-performed refactorings), EM-Assist showed a much larger improvement, achieving a Recall@5 of 53.4% compared to 39.4% for the best previous tool (JExtract), indicating better alignment with real-world practices.
  • Developer Usefulness: Firehouse surveys with 16 industrial developers working on mature projects (IntelliJ IDEA CE, JetBrains Runtime) showed high acceptance: 81.3% found EM-Assist's suggestions useful and potentially applicable to their code.

The paper concludes that synergistically combining LLMs for creative suggestion generation and IDE static analysis for validation and safe execution is a promising approach for refactoring tools. EM-Assist represents a step towards AI assistants that effectively augment developer workflows for code renovation, providing suggestions more aligned with human intuition while ensuring correctness.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 6 likes about this paper.