Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-Driven Code Refactoring

Updated 25 January 2026
  • LLM-driven refactoring is an automated approach that uses large language models to perform both syntactic and semantic code transformations.
  • It employs techniques like zero-shot, few-shot, and chain-of-thought prompting combined with static analysis and human oversight to ensure reliable code improvements.
  • Empirical studies show LLMs can match or exceed developer performance by significantly boosting unit test pass rates and reducing code smells.

LLM-Driven Refactoring refers to the automated or semi-automated process of improving source code structure and quality using LLMs such as GPT-4, StarCoder2, and their derivatives. Unlike traditional rule-based tools, LLMs leverage vast corpora of code and natural language to realize both syntactic and semantic code transformations across a range of languages and domains. Recent empirical studies demonstrate that LLMs can match or exceed developer performance on certain refactoring tasks, provided sufficient controls are in place for verification, safety, and robustness.

1. Underlying Principles and Model Architectures

Modern LLMs used for code refactoring—such as StarCoder2-15B-instruct, GPT-4o, and ChatGPT—are transformer architectures trained on large, multi-language code repositories with additional instruction tuning (Cordeiro et al., 2024, Midolo et al., 19 Jan 2026). These models are adept at producing systematic, pattern-based transformations (e.g., renaming, extraction, logic untangling) by leveraging billions of code and comment examples. Their ability to generalize over code idioms and refactoring patterns enables support for both conventional and domain-specific improvements.

Specialization for Refactoring

Certain models, such as StarCoder2, are further specialized with explicit “refactor” instructions. This specialization increases their accuracy when prompted for specific, high-frequency transformation patterns (e.g., Magic Number removal) (Cordeiro et al., 2024). Instruction tuning for models in multiple programming languages enables robust cross-language code improvements (Tapader et al., 26 Nov 2025, Cordeiro et al., 2024).

2. Prompt Engineering Strategies

Prompt design critically influences LLM-driven refactoring outcomes.

Prompting Strategy Description Typical Impact/Use
Zero-Shot Minimal instruction, no examples Baseline; pass rates low for complex refactorings (Cordeiro et al., 2024, Liu et al., 2024)
Chain-of-Thought (CoT) Instructions plus candidate refactorings and definitions Increases test pass rates and smell reduction; promotes diversity (Cordeiro et al., 2024)
One-Shot/Few-Shot Includes one or more human-crafted before/after examples Significant gains in correctness; mitigates hallucinations (Shirafuji et al., 2023, Tapader et al., 26 Nov 2025)
Domain-Taxonomy Input Injects structured migration or refactoring scenario taxonomies Boosts precision and recall for domain-specific refactoring (e.g., Qiskit) (Suárez et al., 17 Jun 2025)

Key prompt engineering findings include:

  • Explicitly specifying the refactoring type in the prompt raises identification success from 15.6% to 86.7% (ChatGPT, Java, (Liu et al., 2024)).
  • Supplementing with subcategories (“motivating example: code duplication→extract method”) and restricting context (to relevant classes/methods) lead to further improvement.
  • Sampling multiple generations (pass@5) increases functional correctness for LLM outputs, e.g., unit test pass rates up by 28.8% (Cordeiro et al., 2024).
  • CoT and one-shot styles increase code smell reduction and unit test pass rates by several percentage points (Cordeiro et al., 2024, Tapader et al., 26 Nov 2025).

3. Refactoring Capabilities, Metrics, and Empirical Results

LLM-driven refactoring spans a broad set of code transformations: Extract Method, Inline Method, Move Method, Rename Variable, Replace Magic Number, and many more. Empirical studies measure both correctness and quality improvements.

Key Empirical Metrics

Quantitative Performance

Study Language(s) LLM Core Result Type Key Results
(Cordeiro et al., 2024) Java StarCoder2 Systematic smell reduction SRR: 44.4% (LLM) vs. 24.3% (developers), Δ=+20.1pp
(Liu et al., 2024) Java ChatGPT/Gemini Opportunity identification Type-aware prompt: 86.7% success (ChatGPT), ↑71.1pp
(Shirafuji et al., 2023) Python GPT-3.5 Complexity/length reduction 17.35% lower CC, 25.84% fewer LOC, >95% functional
(Tapader et al., 26 Nov 2025) Multilang GPT-3.5-ft Compilability, correctness Java: 99.99% (10-shot), 94.78% compilability
(Xu et al., 18 Mar 2025) Java GPT+/multiagent Method-level (multiagent RAG) 82.8% compile+pass vs. 8.7% baseline
(Batole et al., 26 Mar 2025) Java GPT-4o/MM-assist Move Method (IDE+embedding RAG) Recall@1: 67% (LLM+IDE) vs. 21–40% (prior rules)
(Pomian et al., 2024) Java/Kotlin GPT-3.5 Extract Method (IDE plugin) Recall@5: 53.4% (LLM) vs. 39.4% (static-analysis)
(Midolo et al., 19 Jan 2026) Python GPT-4o Class-level refactoring 84.4% test pass, reduced cognitive complexity, –2.4% read
(Oueslati et al., 5 Nov 2025) Java GPT-4o/StarCoder2 Multi-agent (planning, tool-calls) 90% unit test pass, SRR 52.5%, QMOOD gain (reusability)

Notably, LLMs consistently outperform or match developers on systematic, localized refactorings—Magic Number elimination, Long Statement splitting, Extract Method, and automated idiomatization (Cordeiro et al., 2024, Zhang et al., 2024). Conversely, they underperform on context-dependent, architectural, or multi-module refactorings where cross-class reasoning or domain logic is required (Cordeiro et al., 2024, Robredo et al., 9 Sep 2025, Oueslati et al., 5 Nov 2025). LLM hallucinations (unsafe or incorrect edits) occur in 6–8% of unfiltered outputs (Liu et al., 2024, Cordeiro et al., 2024).

4. Multi-Agent and Hybrid Architectures

Multi-agent LLM systems (e.g., RefAgent, MANTRA) modularize refactoring into pipelined sub-tasks—planning, generation, compilation, testing, and self-reflection—handled by specialized agents coordinating via structured handoffs (Oueslati et al., 5 Nov 2025, Xu et al., 18 Mar 2025, Siddeeq et al., 24 Jun 2025). This decouples local transformations from global codebase validation and provides robust error recovery via feedback loops (e.g., up to 20 iterations of compile/test/fix cycles in RefAgent) (Oueslati et al., 5 Nov 2025).

Hybrid Design Characteristics

5. Best Practices and Limitations

Principal Limitations

6. Domains, Extensions, and Future Directions

LLM-driven refactoring extends beyond general-purpose programming. Demonstrated applications include:

  • Unit Test Code Quality: LLM+DSL-driven frameworks (e.g., UTRefactor) achieve 89% test-smell reduction across six Java projects, far exceeding prior tools (Gao et al., 2024).
  • Quantum Code Migration: Taxonomy-guided LLM prompting supports complex migrations (e.g., Qiskit v0.45→0.46), with higher precision/recall for API change identification than non-taxonomy prompts (Suárez et al., 17 Jun 2025, Suárez et al., 8 Jun 2025).
  • Python Idiomatization: Hybrid LLM+analytic rule systems (RIdiom) outperform purely neural or rule-based baselines (>90% F1 on idiom transformation) (Zhang et al., 2024).
  • Energy-Aware HPC Refactoring: Iterative, agentic LLM pipelines (LASSI-EE) produce ∼47% energy reduction on GPU scientific kernels (Dearing et al., 4 May 2025).

Anticipated directions include:

7. Summary

LLM-driven refactoring is a rapidly maturing paradigm for automated code restructuring that leverages the pattern-recognition and generative capabilities of modern LLMs. When paired with context-aware prompts, hybrid retrieval/static-analysis systems, agent-based toolchains, and rigorous verification, LLMs can achieve or surpass developer performance on systematic, maintainability-driven transformations. However, safe deployment at scale requires verification scaffolding, context management, and human oversight, especially for complex, architecture-level or high-stakes refactoring scenarios. Current research continues to expand the domain reach and capabilities of LLM-driven refactoring, with ongoing work in multi-language support, explainable recommendations, and integration with established developer workflows (Cordeiro et al., 2024, Liu et al., 2024, Tapader et al., 26 Nov 2025, Xu et al., 18 Mar 2025, Oueslati et al., 5 Nov 2025, Midolo et al., 19 Jan 2026, Pomian et al., 2024, Gao et al., 2024, Zhang et al., 2024, Suárez et al., 17 Jun 2025, Shirafuji et al., 2023, Khairnar et al., 12 Aug 2025, Dearing et al., 4 May 2025, Batole et al., 26 Mar 2025, Siddeeq et al., 24 Jun 2025, Robredo et al., 9 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-Driven Refactoring.