LLM-Driven Code Refactoring
- LLM-driven refactoring is an automated approach that uses large language models to perform both syntactic and semantic code transformations.
- It employs techniques like zero-shot, few-shot, and chain-of-thought prompting combined with static analysis and human oversight to ensure reliable code improvements.
- Empirical studies show LLMs can match or exceed developer performance by significantly boosting unit test pass rates and reducing code smells.
LLM-Driven Refactoring refers to the automated or semi-automated process of improving source code structure and quality using LLMs such as GPT-4, StarCoder2, and their derivatives. Unlike traditional rule-based tools, LLMs leverage vast corpora of code and natural language to realize both syntactic and semantic code transformations across a range of languages and domains. Recent empirical studies demonstrate that LLMs can match or exceed developer performance on certain refactoring tasks, provided sufficient controls are in place for verification, safety, and robustness.
1. Underlying Principles and Model Architectures
Modern LLMs used for code refactoring—such as StarCoder2-15B-instruct, GPT-4o, and ChatGPT—are transformer architectures trained on large, multi-language code repositories with additional instruction tuning (Cordeiro et al., 2024, Midolo et al., 19 Jan 2026). These models are adept at producing systematic, pattern-based transformations (e.g., renaming, extraction, logic untangling) by leveraging billions of code and comment examples. Their ability to generalize over code idioms and refactoring patterns enables support for both conventional and domain-specific improvements.
Specialization for Refactoring
Certain models, such as StarCoder2, are further specialized with explicit “refactor” instructions. This specialization increases their accuracy when prompted for specific, high-frequency transformation patterns (e.g., Magic Number removal) (Cordeiro et al., 2024). Instruction tuning for models in multiple programming languages enables robust cross-language code improvements (Tapader et al., 26 Nov 2025, Cordeiro et al., 2024).
2. Prompt Engineering Strategies
Prompt design critically influences LLM-driven refactoring outcomes.
| Prompting Strategy | Description | Typical Impact/Use |
|---|---|---|
| Zero-Shot | Minimal instruction, no examples | Baseline; pass rates low for complex refactorings (Cordeiro et al., 2024, Liu et al., 2024) |
| Chain-of-Thought (CoT) | Instructions plus candidate refactorings and definitions | Increases test pass rates and smell reduction; promotes diversity (Cordeiro et al., 2024) |
| One-Shot/Few-Shot | Includes one or more human-crafted before/after examples | Significant gains in correctness; mitigates hallucinations (Shirafuji et al., 2023, Tapader et al., 26 Nov 2025) |
| Domain-Taxonomy Input | Injects structured migration or refactoring scenario taxonomies | Boosts precision and recall for domain-specific refactoring (e.g., Qiskit) (Suárez et al., 17 Jun 2025) |
Key prompt engineering findings include:
- Explicitly specifying the refactoring type in the prompt raises identification success from 15.6% to 86.7% (ChatGPT, Java, (Liu et al., 2024)).
- Supplementing with subcategories (“motivating example: code duplication→extract method”) and restricting context (to relevant classes/methods) lead to further improvement.
- Sampling multiple generations (pass@5) increases functional correctness for LLM outputs, e.g., unit test pass rates up by 28.8% (Cordeiro et al., 2024).
- CoT and one-shot styles increase code smell reduction and unit test pass rates by several percentage points (Cordeiro et al., 2024, Tapader et al., 26 Nov 2025).
3. Refactoring Capabilities, Metrics, and Empirical Results
LLM-driven refactoring spans a broad set of code transformations: Extract Method, Inline Method, Move Method, Rename Variable, Replace Magic Number, and many more. Empirical studies measure both correctness and quality improvements.
Key Empirical Metrics
- Unit Test Pass Rate (TPR, pass@k): Fraction of generations that pass a test suite; surrogate for functional preservation.
- Code Smell Reduction Rate (SRR): Percentage drop in code smell count: = initial smell count, = post-refactoring (Cordeiro et al., 2024, Oueslati et al., 5 Nov 2025).
- Compilability: Proportion of refactored code that compiles without error (Tapader et al., 26 Nov 2025, Oueslati et al., 5 Nov 2025).
- Cyclomatic and Cognitive Complexity: Standard structural complexity scores (McCabe’s CC) (Shirafuji et al., 2023, Cordeiro et al., 2024, Tapader et al., 26 Nov 2025).
- Edit Distance/Similarity: Structural distance (e.g., Levenshtein) and similarity after transformation (Tapader et al., 26 Nov 2025).
- Tool-Based Quality Metrics: Pylint, Flake8, SonarCloud, HLint, DesigniteJava, and others for code standards and maintainability (Midolo et al., 19 Jan 2026, Oueslati et al., 5 Nov 2025, Zhang et al., 2024).
Quantitative Performance
| Study | Language(s) | LLM | Core Result Type | Key Results |
|---|---|---|---|---|
| (Cordeiro et al., 2024) | Java | StarCoder2 | Systematic smell reduction | SRR: 44.4% (LLM) vs. 24.3% (developers), Δ=+20.1pp |
| (Liu et al., 2024) | Java | ChatGPT/Gemini | Opportunity identification | Type-aware prompt: 86.7% success (ChatGPT), ↑71.1pp |
| (Shirafuji et al., 2023) | Python | GPT-3.5 | Complexity/length reduction | 17.35% lower CC, 25.84% fewer LOC, >95% functional |
| (Tapader et al., 26 Nov 2025) | Multilang | GPT-3.5-ft | Compilability, correctness | Java: 99.99% (10-shot), 94.78% compilability |
| (Xu et al., 18 Mar 2025) | Java | GPT+/multiagent | Method-level (multiagent RAG) | 82.8% compile+pass vs. 8.7% baseline |
| (Batole et al., 26 Mar 2025) | Java | GPT-4o/MM-assist | Move Method (IDE+embedding RAG) | Recall@1: 67% (LLM+IDE) vs. 21–40% (prior rules) |
| (Pomian et al., 2024) | Java/Kotlin | GPT-3.5 | Extract Method (IDE plugin) | Recall@5: 53.4% (LLM) vs. 39.4% (static-analysis) |
| (Midolo et al., 19 Jan 2026) | Python | GPT-4o | Class-level refactoring | 84.4% test pass, reduced cognitive complexity, –2.4% read |
| (Oueslati et al., 5 Nov 2025) | Java | GPT-4o/StarCoder2 | Multi-agent (planning, tool-calls) | 90% unit test pass, SRR 52.5%, QMOOD gain (reusability) |
Notably, LLMs consistently outperform or match developers on systematic, localized refactorings—Magic Number elimination, Long Statement splitting, Extract Method, and automated idiomatization (Cordeiro et al., 2024, Zhang et al., 2024). Conversely, they underperform on context-dependent, architectural, or multi-module refactorings where cross-class reasoning or domain logic is required (Cordeiro et al., 2024, Robredo et al., 9 Sep 2025, Oueslati et al., 5 Nov 2025). LLM hallucinations (unsafe or incorrect edits) occur in 6–8% of unfiltered outputs (Liu et al., 2024, Cordeiro et al., 2024).
4. Multi-Agent and Hybrid Architectures
Multi-agent LLM systems (e.g., RefAgent, MANTRA) modularize refactoring into pipelined sub-tasks—planning, generation, compilation, testing, and self-reflection—handled by specialized agents coordinating via structured handoffs (Oueslati et al., 5 Nov 2025, Xu et al., 18 Mar 2025, Siddeeq et al., 24 Jun 2025). This decouples local transformations from global codebase validation and provides robust error recovery via feedback loops (e.g., up to 20 iterations of compile/test/fix cycles in RefAgent) (Oueslati et al., 5 Nov 2025).
Hybrid Design Characteristics
- Contextual Retrieval-Augmented Generation (RAG): Incorporates database or embedding search to retrieve real-world, contextually similar refactoring examples for in-context learning (Xu et al., 18 Mar 2025, Batole et al., 26 Mar 2025).
- Static Analysis and IDE Integration: Executes static checks (e.g., IntelliJ refactoring preconditions) to filter hallucinations and ensure mechanical feasibility (Batole et al., 26 Mar 2025, Pomian et al., 2024).
- Self-Reflection Loops: Iterative re-prompting on compile/test errors raises functional correctness over naive LLM output by 40–65 percentage points (Oueslati et al., 5 Nov 2025).
- Human-in-the-Loop Controls: Teams are advised to combine LLM suggestions with human review or override, especially for high-risk or architectural modifications (Cordeiro et al., 2024, Oueslati et al., 5 Nov 2025, Robredo et al., 9 Sep 2025).
5. Best Practices and Limitations
Recommended Practices
- Use explicit prompt engineering to specify refactoring type and intent; supply subcategory motivation and minimize code context to focus model attention (Liu et al., 2024, Cordeiro et al., 2024).
- Combine one/few-shot prompting with multi-proposal generation (pass@3/pass@5), empirically observed to maximize both correctness and code quality improvement (Shirafuji et al., 2023, Cordeiro et al., 2024, Tapader et al., 26 Nov 2025).
- Integrate static analysis, compilation, and automated test feedback into refactoring pipelines (Pomian et al., 2024, Oueslati et al., 5 Nov 2025).
- Always validate LLM outputs with automated test suites and, where possible, static linters and code smell detectors (Midolo et al., 19 Jan 2026, Zhang et al., 2024).
- Deploy LLM pipelines in CI environments with automated refactoring, auto-testing, and human approval loops (Cordeiro et al., 2024, Oueslati et al., 5 Nov 2025).
Principal Limitations
- LLM Hallucinations: Unsafe, uncompilable, or semantically altering edits occur in 6–8% of outputs unless filtered (Liu et al., 2024, Cordeiro et al., 2024).
- Context Boundaries: Inability to reason globally across large, multi-file codebases is a key bottleneck; modular or RAG-based architectures partially mitigate this (Batole et al., 26 Mar 2025, Oueslati et al., 5 Nov 2025).
- Overrefactoring: Tendency to modify even trivial or already well-factored code, occasionally worsening readability or introducing subtle bugs (Shirafuji et al., 2023, Midolo et al., 19 Jan 2026).
- Comment/Metadata Loss: LLMs may omit, translate, or drop comments, harming understandability (Shirafuji et al., 2023).
- Scalability Concerns: Large-scale, multi-module refactorings (e.g., package reorganizations) often fail due to limited model context (Cordeiro et al., 2024, Oueslati et al., 5 Nov 2025).
- Semantic Naming: LLM proposals for variable or method names are sometimes non-idiomatic or misleading (Liu et al., 2024).
6. Domains, Extensions, and Future Directions
LLM-driven refactoring extends beyond general-purpose programming. Demonstrated applications include:
- Unit Test Code Quality: LLM+DSL-driven frameworks (e.g., UTRefactor) achieve 89% test-smell reduction across six Java projects, far exceeding prior tools (Gao et al., 2024).
- Quantum Code Migration: Taxonomy-guided LLM prompting supports complex migrations (e.g., Qiskit v0.45→0.46), with higher precision/recall for API change identification than non-taxonomy prompts (Suárez et al., 17 Jun 2025, Suárez et al., 8 Jun 2025).
- Python Idiomatization: Hybrid LLM+analytic rule systems (RIdiom) outperform purely neural or rule-based baselines (>90% F1 on idiom transformation) (Zhang et al., 2024).
- Energy-Aware HPC Refactoring: Iterative, agentic LLM pipelines (LASSI-EE) produce ∼47% energy reduction on GPU scientific kernels (Dearing et al., 4 May 2025).
Anticipated directions include:
- Extending agent-based frameworks to new languages and deeper refactoring types (Oueslati et al., 5 Nov 2025, Siddeeq et al., 24 Jun 2025).
- Human-agent collaborative workflows for higher-level, domain-specific, or architectural refactorings (Oueslati et al., 5 Nov 2025, Robredo et al., 9 Sep 2025).
- Improved natural-language rationales to help developers weigh tradeoffs in LLM-suggested transformations (Zhang et al., 2024, Robredo et al., 9 Sep 2025).
- Automated taxonomy extraction and retrieval-augmented context for robust migration/refactoring in evolving libraries (Suárez et al., 17 Jun 2025, Suárez et al., 8 Jun 2025).
7. Summary
LLM-driven refactoring is a rapidly maturing paradigm for automated code restructuring that leverages the pattern-recognition and generative capabilities of modern LLMs. When paired with context-aware prompts, hybrid retrieval/static-analysis systems, agent-based toolchains, and rigorous verification, LLMs can achieve or surpass developer performance on systematic, maintainability-driven transformations. However, safe deployment at scale requires verification scaffolding, context management, and human oversight, especially for complex, architecture-level or high-stakes refactoring scenarios. Current research continues to expand the domain reach and capabilities of LLM-driven refactoring, with ongoing work in multi-language support, explainable recommendations, and integration with established developer workflows (Cordeiro et al., 2024, Liu et al., 2024, Tapader et al., 26 Nov 2025, Xu et al., 18 Mar 2025, Oueslati et al., 5 Nov 2025, Midolo et al., 19 Jan 2026, Pomian et al., 2024, Gao et al., 2024, Zhang et al., 2024, Suárez et al., 17 Jun 2025, Shirafuji et al., 2023, Khairnar et al., 12 Aug 2025, Dearing et al., 4 May 2025, Batole et al., 26 Mar 2025, Siddeeq et al., 24 Jun 2025, Robredo et al., 9 Sep 2025).