LLM Repair: Automated Code Correction
- LLM Repair is a technique that applies LLMs to automatically repair non-compilable code using compiler logs and exemplar patches.
- It leverages contextual prompt synthesis—combining source files, error logs, and representative fixes—to generate precise patches for CI environments.
- Evaluations show that CodeLlama achieves a 63% repair success rate with repairs often completed within 8 minutes, dramatically reducing manual debugging time.
LLM Repair encompasses a spectrum of techniques that enable models—especially code-generating LLMs—to correct faults in software artifacts, repair their own internal knowledge, or facilitate the automatic remediation of code errors in software engineering workflows. In recent research, "LLM Repair" primarily refers to (1) leveraging LLMs as autonomous or assisted agents for fixing code-level errors in source code, often in the absence of test cases and (2) applying targeted updates to the model's own parameters to remediate systematic failures in generated outputs. The field bridges traditional Automated Program Repair (APR) methodologies and emerging LLM-centric paradigms, with direct applications to continuous integration (CI), industrial codebases, and embedded system development.
1. Automated Repair in the Absence of Test Cases
One of the central contributions in LLM Repair is the introduction and evaluation of architectures that repair non-compilable code—specifically compilation errors—when conventional tests cannot be run, as typified by the "Shadow Job" pipeline for industrial embedded systems (Fu et al., 15 Oct 2025). Unlike test-based APR (which proposes and validates patches against regression oracles), LLM-driven compilation-error repair must operate using only static compiler diagnostics and historical fix patterns.
Core Pipeline Components
- Trigger Mechanism: Upon CI build failure at the compilation stage, a parallel "Shadow Job" is invoked. The system scrapes logs to extract error diagnostics structured as error blocks (file names, line numbers, error codes, descriptions).
- Prompt Synthesis: Prompts are constructed per failed source file, incorporating up to four context elements: the full file (I0), its error log (I1), an erroneous snippet (I2; 1-3 lines around the error), and a single representative human-written fix for this error category (I3) mined from historical commits.
- LLM Patch Generation: A state-of-the-art LLM receives the prompt and generates a single-file patch hypothesized to correct the error. No direct multi-file or architectural changes are attempted in the base pipeline.
Prompt Variant Optimization
Empirical evaluation demonstrates the importance of prompt construction. The most effective configuration combines the full file, error log, minimal failed snippet, and an archetypal fix, achieving substantially higher success rates than prompts with less or more context.
2. Dataset, Models, and Experimental Setup
The LLM Repair paradigm relies on longitudinal industrial codebases and extensive CI artifact collection:
- Dataset: Over 40,000 CI build failures were collected from a year’s CI trace of a large embedded C/C++ product. A random sample of 1,000 unique compilation errors served as the experimental corpus.
- LLM Selection: Four LLMs (~7B parameters each) were benchmarked: Code-specialized (CodeT5+, CodeLlama) and general (Falcon, Bloom). Pretraining on code is shown to significantly influence repair efficacy.
Error Taxonomy
Compilation failures encompass:
- Missing headers
- Undefined/unknown symbols
- Static-check violations
- Incorrect API usage
- Enum-switch incompleteness
- Namespace or type clashes
- "Missing include" and single-line syntax errors
3. Metrics and Quantitative Results
Principal Evaluation Metrics
- Repair Success Rate:
- Reasonable Fix Ratio:
- Repair Latency:
Wall-clock time-to-fix, from initial failure detection to successful CI re-run acceptance.
Headline Outcomes
| Model | Repair Success (%) | Reasonable Fix Ratio (%) | Time-to-Repair (≤8 min, %) |
|---|---|---|---|
| CodeLlama | 63 | 83 | 64 |
| CodeT5+ | 58 | — | — |
| Falcon | 43 | — | — |
| Bloom | 39 | — | — |
| Human Dev. | ~100 | — | Hours |
- LLM-Equipped CI resolves up to 63% of previously unrepairable compilation errors.
- Qualitative Patch Review: Among successful patches by CodeLlama, 17% were identical or semantically identical to human fixes, 66% plausible but distinct, and 17% implausible.
- Latency: >60% of successful repairs complete within 8 minutes; majority of manual resolutions take multiple hours.
Task Class Scope
LLMs excel in resolving errors confined to a few lines or requiring limited local context. Cross-file dependency changes, deep semantic corrections, and architectural restructurings remain largely out of scope for current configurations.
4. Comparative Analysis and Design Factors
Model Type
Models pretrained on code (CodeT5+, CodeLlama) consistently surpass general LLMs (Falcon, Bloom) by approximately 20 percentage points in repair success, supporting the necessity of code-centric pretraining for this domain.
Prompt Content
Prompt composition exhibits a nontrivial trade-off:
- Excessively verbose prompts (e.g., entire source file without guided context) distract the model.
- Overly terse prompts (e.g., error log only) deprive the LLM of the requisite context.
- Optimal is prompt variant #6, which includes the raw file, log, precise snippet, and a single exemplary fix.
Limitations and Failure Modes
- Multi-file or build-configuration errors (e.g., missing includes in other files) are not addressed.
- Logical bugs unobservable until runtime are out of reach (test cases not available).
- Multi-hunk or fundamentally refactored solutions (spanning CI phases) are not reliably generated.
- Patch rank selection is not addressed; the system applies the candidate directly to the error block.
5. Implications for Workflow and Deployment
Advantages Over Manual Debugging
LLM repair slashes debugging turnaround from hours per error to minutes per error in the majority of cases, suggesting substantial productivity gains in CI pipelines even for large, safety-critical codebases. By leveraging compiler logs and historical fix patterns alone, the system automates initial repair attempts that would otherwise block continuous delivery.
Practical Deployment Considerations
- Computational Footprint: The LLM inference costs and response times (≤8 min in most cases) align with industrial CI cadence.
- Integration: The Shadow Job framework can be layered alongside any existing CI system, requiring only log scraping and prompt generation.
- Patch Validation: Only compilation passes are available as an oracle; downstream functional errors require human-in-the-loop or subsequent test-phase validation.
- Iterative/Chained Repair: Up to five "iterations" are permitted before falling back to manual repair.
Prospective Enhancements
- Chaining prompts in a stepwise/iterative fashion (i.e., chain-of-thought or self-refinement) could resolve complex, multi-step failures.
- Integrating lightweight static analysis could serve as a pre-check to filter dangerous or incorrect LLM-generated patches.
- Fine-tuning models on internal bug/fix corpora is likely to produce further gains.
- Generalizing to multi-file repair and dynamic test validation will be essential for system-wide adoption.
6. Broader Context in Automated Program Repair
LLM repair for codebases lacking test suites represents a significant expansion in the scope of APR research. Previously, the dependence on test-based oracles precluded automated fix generation for non-compilable software artifacts, especially in embedded systems and hardware–software co-development scenarios. The empirical demonstration of ∼63% automated repair, with ∼83% of accepted patches deemed reasonable, establishes feasibility for downstream tooling and encourages further research into artifact- and context-limited repair.
Moreover, this approach aligns with emerging trends in agentic and context-aware code synthesis, highlighting the move toward automated CI mediation, especially where software artifacts are large, safety critical, or otherwise challenging for traditional APR pipelines.
7. Summary Table: Pipeline Phases and Metrics
| Phase | Mechanism | Output / Metric |
|---|---|---|
| Error Detection | CI log parsing | Error block (file, log, snippet) |
| Prompt Construction | Contextualization + Exemplar | LLM prompt (I0–I3 elements) |
| Patch Generation | LLM inference | Candidate patch |
| Validation | Compilation only | N_fixed, T_debug |
| Evaluation | Human + CI comparison | R_success, R_reasonable |
This framework provides a template for scalable, real-world LLM-powered repair of software artifacts in CI environments where tests are unavailable or incomplete, and establishes rigorous benchmarks for prompt design, code-specialized modeling, and deployment on operational codebases.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free