RepairAgent: Autonomous Software Repair

Updated 14 January 2026

RepairAgent is an autonomous software engineering system that leverages LLMs and specialized agents to plan, synthesize, and validate code repairs.
It integrates iterative context gathering, fault localization, and dynamic feedback from test suites and symbolic validators to refine patch generation.
Architectural paradigms span single-agent, multi-agent, and hybrid neuro-symbolic approaches, each enhancing robustness and generalizability in automated repair.

A RepairAgent is an autonomous or collaborative software engineering system, typically built around LLMs and supporting toolchains, whose core capability is the planning, synthesis, and validation of program repairs or patch suggestions. These agents leverage multi-step reasoning, dynamically orchestrated tool use, and execution feedback to localize faults, propose fixes, and verify them against one or more oracles such as test suites, formal specifications, or structured critiques. RepairAgent systems contrast with classical program repair pipelines by tightly interleaving action planning, contextual information gathering, and iterative patch refinement—often adopting explicit multi-agent or neuro-symbolic architectures for improved robustness and generalizability.

1. Architectural Paradigms and Agent Decomposition

RepairAgent frameworks can be classified by their internal structure and division of responsibilities among sub-agents:

Single-agent (monolithic) architectures wrap an LLM in a closed-loop (e.g., ReAct, FSM) where the entire repair process—information gathering, fault localization, patch synthesis, and testing—is controlled by a single policy and prompt context (Bouzenia et al., 2024, Bouzenia et al., 23 Jun 2025).
Multi-agent approaches decompose the workflow into specialized agents with clear role partitioning: for example, CodeR’s five-agent graph (Manager, Reproducer, Fault Localizer, RepairAgent[Editor], ReviewAgent), RAMP’s four collaborating agents (Feedback Integrator, Test Designer, Programmer, Test Executor), and AIR’s tripartite system (Context, Maintenance, Editor) (Chen et al., 2024, Akbarpour et al., 6 Nov 2025, Kaliutau, 9 Dec 2025).
Hybrid neuro-symbolic systems couple an LLM agent loop with explicit static analysis, test execution, and external oracles (e.g., symbolic validators, code knowledge graphs, program state monitors) to ground neural patch generation in symbolic feedback (Maddila et al., 24 Jul 2025, Liu et al., 2024, Kaliutau, 9 Dec 2025).

Notably, RAMP for Ruby exemplifies a lightweight, feedback-centered multi-agent formulation optimized for rapid convergence, while MarsCode Agent and AIR illustrate deep integration of semantic program representations and planner-driven agent orchestration.

2. Repair Loop Dynamics and Feedback Integration

RepairAgents universally implement an iterative reasoning-and-action loop. At each iteration:

The agent (or one of its sub-agents) observes the current artifact state—bug report, code, failed tests, execution traces.
It synthesizes an evidence-informed next action: gather further context, generate tests, propose edits, or validate with oracles.
The system applies or executes the chosen action, collects the outcome, and incorporates it into the evolving prompt or graph state.
Termination occurs when a stopping criterion is met: all tests pass, oracles succeed, or iteration/cost budgets are exhausted.

This dynamic is formalized in various frameworks via finite state machines (FSMs) (Bouzenia et al., 2024), plan-execute graphs (Chen et al., 2024), or recursive tree-of-thoughts search (Luo et al., 25 Nov 2025). Central to efficacy is the real-time injection of execution feedback (test verdicts, error traces, state diffs) and the corresponding realignment of the agent's reasoning (refined hypotheses, self-reflections, patch adjustments), as observed in RAMP's iterative Reflector loop and AdverIntent-Agent’s adversarial feedback cycles (Akbarpour et al., 6 Nov 2025, Ye et al., 19 May 2025).

3. Information Gathering, Fault Localization, and Context Management

Robust repair requires identifying relevant context and precisely localizing the fault:

Code retrieval and fault localization: Spectrum-based fault localization (SBFL, Ochiai), code knowledge graphs (CKG), blame heuristics, and RL-guided data provenance tracing are prevalent for narrowing down probable bug regions (Chen et al., 2024, Kaliutau, 9 Dec 2025, Shi et al., 2 Nov 2025). SBFL algorithms compute suspiciousness scores—for example, Ochiai’s formula for elements $e$ :

$\mathrm{suspiciousness}(e) = \frac{\mathit{failed}(e)}{\sqrt{\mathit{TotalFailed} \times (\mathit{failed}(e)+\mathit{passed}(e))}}$

Prompt construction and context assembly: Prompts typically combine non-historical metadata (bug description, failing tests, localization lines) and history-derived context (blame diffs, function history) to guide LLM synthesis (Shi et al., 2 Nov 2025).
Avoiding the "Semantic Trap": DTG-based representations in AIR enable causal rather than semantically-near retrieval, ensuring that only code with direct data lineage to the buggy state is traversed (Kaliutau, 9 Dec 2025).

Table 1: Core Context Sources Employed by RepairAgents

Method/Agent	Contextual Signal	Mechanism
RAMP	Sample I/O, error traces	Self-generated tests, reflection
MarsCode, SemAgent	Dynamic traces, SBFL	Execution feedback, entity extraction
HAFixAgent	Blame, historic diffs	Prompt history injection
AIR	Data transformation graph	RL-guided causal tracing

4. Patch Generation, Review, and Validation Protocols

Patch synthesis in RepairAgent systems is informed by the available context and utilizes targeted LLM prompt strategies:

Chain-of-Thought (CoT) and Structured CoT: Agents prompt LLMs with multi-step or decomposed chain prompts integrating specification comprehension, root cause analysis, and code synthesis (Akbarpour et al., 6 Nov 2025, Ye et al., 19 May 2025).
Patch validation and selection: Multi-stage filtering—compilation, test execution (pass@k, Top@n), regression detection, and symbolic checks—ensure plausible and semantically coherent repairs (Bouzenia et al., 2024, Ye et al., 19 May 2025, Maddila et al., 24 Jul 2025).
Reviewer/re-ranking stages: Reviewer agents and classifiers (e.g., LLM-as-a-Judge) assess patch quality, relevance, and non-regression before submission to humans or deployment (Maddila et al., 24 Jul 2025, Chen et al., 2024, Joos et al., 15 Sep 2025).

Formally, patch scoring may be expressed as:

$\mathrm{Score}(p) = w_1 \cdot \mathrm{plausibility}(p) + w_2 \cdot \mathrm{correctness}(p, i)$

where $w_1, w_2$ prioritize passing original and adversarial tests (Ye et al., 19 May 2025).

5. Evaluation Benchmarks and Performance Metrics

RepairAgent effectiveness is benchmarked via community-recognized datasets:

Tasks: Defects4J (Java), SWE-bench and SWE-bench-Lite (Python), AGENTISSUE-BENCH (agent systems), XCodeEval (Ruby), RTL-Repair (hardware) (Bouzenia et al., 2024, Chen et al., 2024, Rahardja et al., 27 May 2025, Akbarpour et al., 6 Nov 2025, Luo et al., 25 Nov 2025).
Metrics: pass@k, Top@n, plausible and correct fix rate, localization accuracy (file/function), regression reduction (RR), resource consumption (tokens, wallclock time), and semantic equivalence to ground truth (Rondon et al., 13 Jan 2025, Rahardja et al., 27 May 2025, Nashid et al., 14 Nov 2025).
Reported Results: For example, RAMP achieves 67% pass@1 on XCodeEval Ruby, MarsCode Agent attains 34% solve rate and 88.3% file-localization on SWE-bench Lite, and AIR attains 87.1% on SWE-Verified (Akbarpour et al., 6 Nov 2025, Liu et al., 2024, Kaliutau, 9 Dec 2025).

Table 2: Selected Published Solve Rates (pass@1)

Agent/System	Benchmark	Solve Rate
RepairAgent	Defects4J	164/835 (19.6%)
MarsCode Agent	SWE-bench Lite (Python)	34.0%
RAMP	XCodeEval (Ruby, subset)	67.0%
AIR	SWE-Verified	87.1%

Agentic systems often outperform prompt-only and single-agent baselines, with multi-agent partitioning, history-aware context, and explicit feedback loops being major contributors to increased success rates.

6. Insights, Limitations, and Domain Generality

Behavioral analyses highlight several recurring themes in RepairAgent research:

Positive motifs: Alternating cycles of fix generation, test execution, and exploration are diagnostic of successful repairs (Bouzenia et al., 23 Jun 2025).
Failure modes: Unproductive loops (repetitive patch-test cycles), thought–action misalignment, and premature termination without validation are strongly failure-correlated (Bouzenia et al., 23 Jun 2025, Nashid et al., 14 Nov 2025).
Practicality and extensibility: Lightweight agents (e.g., RAMP, CodeCureAgent) demonstrate rapid convergence and extensibility to new languages with minimal retuning (Akbarpour et al., 6 Nov 2025, Joos et al., 15 Sep 2025).
Domain coverage and hard cases: RepairAgent success varies by bug type (e.g., easier for single-file or wrong-answer bugs, harder for time/memory resource limits, multi-hunk, or agent-system issues) (Akbarpour et al., 6 Nov 2025, Nashid et al., 14 Nov 2025, Rahardja et al., 27 May 2025).

7. Open Challenges and Future Directions

Strategic priorities for future RepairAgent research include:

Improving test and oracle reliability: High false-negative rates in self-generated tests slow convergence but can be mitigated by more precise test generation, self-critique, or hybrid oracles (Akbarpour et al., 6 Nov 2025).
Scaling to complex project structures: Extensions beyond single-file tasks require cross-file harnesses, architectural reasoning, and potentially more advanced retrieval or RL-based navigation (Kaliutau, 9 Dec 2025, Shi et al., 2 Nov 2025).
Semantic and symbolic integration: Causal context management (DTGs), agentic review, and neuro-symbolic loops are emerging routes for enhanced generality and trustworthiness (Kaliutau, 9 Dec 2025, Maddila et al., 24 Jul 2025).
Real-world deployment: Realistic production deployments demand cost-aware iteration control, model-agnostic toolsets, robust reviewer integration, and human-in-the-loop patch vetting (Maddila et al., 24 Jul 2025, Kaliutau, 9 Dec 2025).
Agent-system self-repair: Repairing LLM-agent systems themselves is notably harder due to volatile external resources, semantic complexity, and nondeterminism of LLM outputs—current resolution rates are markedly lower than for traditional APR (Rahardja et al., 27 May 2025).

In summary, RepairAgent frameworks operationalize automated debugging and patching as a sequence (or coordination) of reasoning, action, and validation steps, grounded in dynamic feedback and refined by explicit agentic structures. These systems demonstrate accelerating effectiveness on challenging software repair tasks and are at the core of next-generation autonomous software maintenance pipelines (Bouzenia et al., 2024, Chen et al., 2024, Kaliutau, 9 Dec 2025, Akbarpour et al., 6 Nov 2025).