LLM Repair: Automated Code Correction

Updated 17 November 2025

LLM Repair is a technique that applies LLMs to automatically repair non-compilable code using compiler logs and exemplar patches.
It leverages contextual prompt synthesis—combining source files, error logs, and representative fixes—to generate precise patches for CI environments.
Evaluations show that CodeLlama achieves a 63% repair success rate with repairs often completed within 8 minutes, dramatically reducing manual debugging time.

LLM Repair encompasses a spectrum of techniques that enable models—especially code-generating LLMs—to correct faults in software artifacts, repair their own internal knowledge, or facilitate the automatic remediation of code errors in software engineering workflows. In recent research, "LLM Repair" primarily refers to (1) leveraging LLMs as autonomous or assisted agents for fixing code-level errors in source code, often in the absence of test cases and (2) applying targeted updates to the model's own parameters to remediate systematic failures in generated outputs. The field bridges traditional Automated Program Repair (APR) methodologies and emerging LLM-centric paradigms, with direct applications to continuous integration (CI), industrial codebases, and embedded system development.

1. Automated Repair in the Absence of Test Cases

One of the central contributions in LLM Repair is the introduction and evaluation of architectures that repair non-compilable code—specifically compilation errors—when conventional tests cannot be run, as typified by the "Shadow Job" pipeline for industrial embedded systems (Fu et al., 15 Oct 2025). Unlike test-based APR (which proposes and validates patches against regression oracles), LLM-driven compilation-error repair must operate using only static compiler diagnostics and historical fix patterns.

Core Pipeline Components

Trigger Mechanism: Upon CI build failure at the compilation stage, a parallel "Shadow Job" is invoked. The system scrapes logs to extract error diagnostics structured as error blocks (file names, line numbers, error codes, descriptions).
Prompt Synthesis: Prompts are constructed per failed source file, incorporating up to four context elements: the full file (I0), its error log (I1), an erroneous snippet (I2; 1-3 lines around the error), and a single representative human-written fix for this error category (I3) mined from historical commits.
LLM Patch Generation: A state-of-the-art LLM receives the prompt and generates a single-file patch hypothesized to correct the error. No direct multi-file or architectural changes are attempted in the base pipeline.

Prompt Variant Optimization

Empirical evaluation demonstrates the importance of prompt construction. The most effective configuration combines the full file, error log, minimal failed snippet, and an archetypal fix, achieving substantially higher success rates than prompts with less or more context.

2. Dataset, Models, and Experimental Setup

The LLM Repair paradigm relies on longitudinal industrial codebases and extensive CI artifact collection:

Dataset: Over 40,000 CI build failures were collected from a year’s CI trace of a large embedded C/C++ product. A random sample of 1,000 unique compilation errors served as the experimental corpus.
LLM Selection: Four LLMs (~7B parameters each) were benchmarked: Code-specialized (CodeT5+, CodeLlama) and general (Falcon, Bloom). Pretraining on code is shown to significantly influence repair efficacy.

Error Taxonomy

Compilation failures encompass:

Missing headers
Undefined/unknown symbols
Static-check violations
Incorrect API usage
Enum-switch incompleteness
Namespace or type clashes
"Missing include" and single-line syntax errors

3. Metrics and Quantitative Results

Principal Evaluation Metrics

Repair Success Rate:

$R_\mathrm{success} = \frac{N_\mathrm{fixed}}{N_\mathrm{total}} \times 100\%$

Reasonable Fix Ratio:

$R_\mathrm{reasonable} = \frac{N_\mathrm{reasonable}}{N_\mathrm{fixed}} \times 100\%$

Repair Latency:

Wall-clock time-to-fix, from initial failure detection to successful CI re-run acceptance.

Headline Outcomes

Model	Repair Success (%)	Reasonable Fix Ratio (%)	Time-to-Repair (≤8 min, %)
CodeLlama	63	83	64
CodeT5+	58	—	—
Falcon	43	—	—
Bloom	39	—	—
Human Dev.	~100	—	Hours

LLM-Equipped CI resolves up to 63% of previously unrepairable compilation errors.
Qualitative Patch Review: Among successful patches by CodeLlama, 17% were identical or semantically identical to human fixes, 66% plausible but distinct, and 17% implausible.
Latency: >60% of successful repairs complete within 8 minutes; majority of manual resolutions take multiple hours.

Task Class Scope

LLMs excel in resolving errors confined to a few lines or requiring limited local context. Cross-file dependency changes, deep semantic corrections, and architectural restructurings remain largely out of scope for current configurations.

4. Comparative Analysis and Design Factors

Model Type

Models pretrained on code (CodeT5+, CodeLlama) consistently surpass general LLMs (Falcon, Bloom) by approximately 20 percentage points in repair success, supporting the necessity of code-centric pretraining for this domain.

Prompt Content

Prompt composition exhibits a nontrivial trade-off:

Excessively verbose prompts (e.g., entire source file without guided context) distract the model.
Overly terse prompts (e.g., error log only) deprive the LLM of the requisite context.
Optimal is prompt variant #6, which includes the raw file, log, precise snippet, and a single exemplary fix.

Limitations and Failure Modes

Multi-file or build-configuration errors (e.g., missing includes in other files) are not addressed.
Logical bugs unobservable until runtime are out of reach (test cases not available).
Multi-hunk or fundamentally refactored solutions (spanning CI phases) are not reliably generated.
Patch rank selection is not addressed; the system applies the candidate directly to the error block.

5. Implications for Workflow and Deployment

Advantages Over Manual Debugging

LLM repair slashes debugging turnaround from hours per error to minutes per error in the majority of cases, suggesting substantial productivity gains in CI pipelines even for large, safety-critical codebases. By leveraging compiler logs and historical fix patterns alone, the system automates initial repair attempts that would otherwise block continuous delivery.

Practical Deployment Considerations

Computational Footprint: The LLM inference costs and response times (≤8 min in most cases) align with industrial CI cadence.
Integration: The Shadow Job framework can be layered alongside any existing CI system, requiring only log scraping and prompt generation.
Patch Validation: Only compilation passes are available as an oracle; downstream functional errors require human-in-the-loop or subsequent test-phase validation.
Iterative/Chained Repair: Up to five "iterations" are permitted before falling back to manual repair.

Prospective Enhancements

Chaining prompts in a stepwise/iterative fashion (i.e., chain-of-thought or self-refinement) could resolve complex, multi-step failures.
Integrating lightweight static analysis could serve as a pre-check to filter dangerous or incorrect LLM-generated patches.
Fine-tuning models on internal bug/fix corpora is likely to produce further gains.
Generalizing to multi-file repair and dynamic test validation will be essential for system-wide adoption.

6. Broader Context in Automated Program Repair

LLM repair for codebases lacking test suites represents a significant expansion in the scope of APR research. Previously, the dependence on test-based oracles precluded automated fix generation for non-compilable software artifacts, especially in embedded systems and hardware–software co-development scenarios. The empirical demonstration of ∼63% automated repair, with ∼83% of accepted patches deemed reasonable, establishes feasibility for downstream tooling and encourages further research into artifact- and context-limited repair.

Moreover, this approach aligns with emerging trends in agentic and context-aware code synthesis, highlighting the move toward automated CI mediation, especially where software artifacts are large, safety critical, or otherwise challenging for traditional APR pipelines.

7. Summary Table: Pipeline Phases and Metrics

Phase	Mechanism	Output / Metric
Error Detection	CI log parsing	Error block (file, log, snippet)
Prompt Construction	Contextualization + Exemplar	LLM prompt (I0–I3 elements)
Patch Generation	LLM inference	Candidate patch
Validation	Compilation only	N_fixed, T_debug
Evaluation	Human + CI comparison	R_success, R_reasonable

This framework provides a template for scalable, real-world LLM-powered repair of software artifacts in CI environments where tests are unavailable or incomplete, and establishes rigorous benchmarks for prompt design, code-specialized modeling, and deployment on operational codebases.

PDF Markdown Chat (Pro)

References (1)

Auto-repair without test cases: How LLMs fix compilation errors in large industrial embedded code (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to LLM Repair.