Agent-Based Program Repair
- Agent-Based Program Repair is a modular approach that uses specialized agents to localize faults, synthesize patches, and validate fixes iteratively.
- Multi-agent systems coordinate distinct roles such as reasoning, test generation, and patch validation to improve repair precision and efficiency.
- Integrating symbolic, semantic, and historical signals reduces regression risks and enhances repair accuracy across diverse codebases and benchmarks.
Agent-Based Program Repair encompasses a family of approaches wherein autonomous or collaborative software agents, typically powered by LLMs, orchestrate the end-to-end task of localizing, synthesizing, and validating program repairs. Unlike traditional automated program repair (APR) techniques that rely on hardcoded control loops or single-shot deep learning prompts, agent-based APR dynamically plans and executes a diverse sequence of tool invocations, leverages explicit reasoning and feedback from execution traces or human-derived signals, and increasingly structures repair as a multi-agent or modular pipeline. These systems now form the backbone of state-of-the-art repair benchmarks in industry and academia, scaling from competitive programming bugs to repository-scale, multi-hunk defects.
1. Core Architectures of Agent-Based Program Repair
Agent-based APR frameworks are typically constructed as modular pipelines, where each agent specializes in a distinct aspect of the repair process, and agents interact through well-defined interfaces. Prominent designs include both single-LLM “autonomous” agents and multi-agent systems, with canonical roles such as:
- Reasoning/Reflection Agent: Performs fault localization and error hypothesis generation, optionally leveraging adversarial critique or repository history (Ye et al., 19 May 2025, Akbarpour et al., 6 Nov 2025, Shi et al., 2 Nov 2025).
- Test Generation Agent: Synthesizes targeted or adversarial test suites to expose faults and filter overfitting patches (Ye et al., 19 May 2025, Akbarpour et al., 6 Nov 2025).
- Patch/Programmer Agent: Produces candidate repairs using prompting schemes such as chain-of-thought or structure-aware reasoning (Akbarpour et al., 6 Nov 2025, Bouzenia et al., 2024).
- Execution/Validation Agent: Applies patches, rebuilds, and validates against regression and correctness tests (Bouzenia et al., 2024, Nashid et al., 14 Nov 2025).
- Context or Navigation Agents: Traverse code repositories or data-flow graphs, incorporating techniques such as Data Transformation Graphs (DTGs), code property graphs (CPGs), or Maple MCP servers (Kaliutau, 9 Dec 2025, Nashid et al., 14 Nov 2025).
- Selector/Judge Agents: Rank, select, or filter candidate patches based on plausibility, test coverage, or alignment with inferable intents (Ye et al., 19 May 2025, Kaliutau, 9 Dec 2025, Cheng et al., 27 Jan 2026).
Control logic is typically realized via a ReAct-style loop (Thought → Action → Observation...) or a finite-state machine (FSM). Agents either share conversational history and memory buffers or communicate through structured message-passing (e.g., JSON fields encapsulating context, traces, candidate solutions) (Akbarpour et al., 6 Nov 2025, Bouzenia et al., 2024).
2. Iterative, Feedback-Driven Repair Loops
At the center of agent-based APR is an iterative repair loop, which may be formalized as:
Here, agents iteratively:
- Reflect or hypothesize about observed failures (e.g., based on execution traces/diagnostics).
- Generate or update a targeted test suite, balancing basic and edge-case coverage.
- Synthesize a candidate repair.
- Validate the patch against the generated and/or benchmark test suites.
- Update guidance (reflection) or context based on failed attempts.
The loop continues for a bounded number of iterations or until an early stop criterion is met (e.g., pass@1 achieved). Evidence shows most systems converge in fewer than five iterations on standard competitive-programming-style tasks (Akbarpour et al., 6 Nov 2025).
3. Agent Coordination and Modularity
Multi-agent decomposition enables several architectural advantages:
- Modularity and Extensibility: Each agent (reflection, test generation, repair, validation) is independently replaceable or tunable. For example, “RAMP” achieves language-agnostic operation by swapping language-specific executors and test formats (Akbarpour et al., 6 Nov 2025).
- Complementary Heuristics: Agents can combine different historical or semantic heuristics to improve repair performance on diverse bug types, e.g., combining file-level diffs with blame-derived function context (Shi et al., 2 Nov 2025).
- Adversarial Reasoning: Systems like AdverIntent-Agent systematically instantiate multiple intent hypotheses, then adversarially generate test sets to maximize the probability of capturing developer intent, yielding increased repair and fault localization precision (Ye et al., 19 May 2025).
- Adaptive Control: Frameworks such as SIADAFIX employ fast/slow “thinking” agents, dynamically selecting among workflows (single-round, iterative, selector-driven) based on pre-classified issue complexity (Cao et al., 17 Oct 2025).
This distributed design supports rapid adaptation to new programming languages, benchmarks, and repair settings.
4. Integration of Symbolic, Semantic, and Historical Signals
Recent agents integrate symbolic reasoning (e.g., static analysis, coverage data, blame history) with LLM-based planning:
- Neuro-Symbolic Loops: Agents incorporate static analysis, test execution, and symbolic code or data-flow invariants as first-class feedback to the LLM's reasoning process, which was empirically demonstrated to increase solve rates in enterprise and production environments (Maddila et al., 24 Jul 2025).
- History-Aware Repair: HAFixAgent injects blame-derived context (e.g., last-modified commit, function-level diff) into the repair prompt, resulting in over 200% improvement in plausible fixes versus baseline agentic systems on Defects4J, with no significant step-count or cost penalty (Shi et al., 2 Nov 2025).
- Semantic and Dataflow Context: Autonomous Issue Resolver (AIR) replaces standard code property graphs with Data Transformation Graphs (DTGs), modeling data states as nodes and repairs as graph-editing tasks. This yields an 87.1% resolution rate on SWE-bench Verified, substantially outperforming CPG- or file-centric agentic baselines (Kaliutau, 9 Dec 2025).
- Multi-Hunk Reasoning: Studies confirm that multi-hunk, multi-file bugs benefit from semantic reasoning, repository-level context via AST/MCP servers, and explicit modeling of edit/regression trade-offs (Nashid et al., 14 Nov 2025, Rondon et al., 13 Jan 2025).
These integrations constrain the space of candidate edits, reduce regression, and improve agent convergence, especially on large or distributed codebases.
5. Test and Patch Generation Paradigms
Agent-based systems innovate in test generation and patch curation:
- On-the-Fly Test Generation: Agents actively synthesize targeted input/output test suites (including edge and performance cases) in each iteration. Ablations demonstrate a nearly 20-point improvement in pass@1 from test generation and self-reflection on Ruby APR tasks (Akbarpour et al., 6 Nov 2025).
- Adversarial and Intent-Linked Tests: Agents generate adversarial test oracles to distinguish between multiple hypothesized program intents, filtering out overfitting and ambiguous patches (Ye et al., 19 May 2025).
- Patch Selection Algorithms: Once multiple candidate patches are synthesized, agents employ selectors prioritizing test-enriched patches (e.g., those including reproduction tests alongside the fix), employing lexicographic ranking over test-presence and patch complexity (Cheng et al., 27 Jan 2026). Such strategies increase reviewer confidence and yield a better trade-off in joint plausible fix and test coverage.
6. Benchmarks, Evaluation, and Empirical Results
Empirical assessment of agent-based APR spans open-source, academic, and enterprise environments, employing metrics such as pass@1, plausible/valid patch rates, regression rates, token/step cost, and reviewer acceptance:
| System/Agent | Benchmark | Domain | Pass@1 / Solve Rate | Regression Rate | Data/Other Notes |
|---|---|---|---|---|---|
| RAMP | XCodeEval | Ruby | 67.0% | N/A | Converges by iter 5, no fine-tuning, >18% gain from test or reflection ablations |
| HAFixAgent | Defects4J v3.0.1 | Java | Up to +212% over RepairAgent | Comparable step/cost | History adds unique fixes, file-level diff preferred |
| SemAgent | SWEBench-Lite | Python | 44.66% | N/A | Outperforms other workflow-based systems, excels at multi-line/edge-case bugs |
| AIR | SWE-bench Verified | Multi | 87.1% | Low (see text) | Uses DTG, multi-agent, RL controller |
| SIADAFIX | SWEBench-Lite | Python | 60.67% | N/A | State-of-the-art open-source agent, adaptive fast/slow workflow |
| Multi-hunk agentic study | Hunk4J | Java | 25.8–93.3% | -1.34 to +2.47 | Codex/Claude Code excel at semantic consistency and regression minimization |
| Passerine | GITS-Eval (Google) | Mixed (5) | 73% (machine bugs) | N/A | Empirical solve/valid rates on large enterprise codebase |
| Cogeneration (BRT+fix) | Internal (Google) | Multi | 34% joint rate | N/A | Freeform cogeneration matches or outperforms fix/BRT only for plausible patches |
Metrics and detailed breakdowns consistently show agent-based designs outperform non-agentic or single-shot LLM variants, particularly when feedback loops, multi-agent roles, and symbolic or semantic context are leveraged (Akbarpour et al., 6 Nov 2025, Shi et al., 2 Nov 2025, Kaliutau, 9 Dec 2025).
7. Future Directions, Challenges, and Limitations
Contemporary research identifies several directions and persistent limitations:
- Scaling Across Languages and Ecosystems: Language-agnostic design (as in RAMP or AIR) generalizes agentic repair to under-studied or polyglot codebases, but tools for C++/Kotlin/etc. remain in early stages (Akbarpour et al., 6 Nov 2025, Kaliutau, 9 Dec 2025).
- History and Trace Integration: While history-augmented agents (e.g., HAFixAgent) are highly effective, handling multi-commit or distributed bugs and scaling contextual injection remain ongoing challenges (Shi et al., 2 Nov 2025).
- Efficient Sampling and Patch Ranking: Trajectory and patch selection sampling is critical for both efficiency and success rates. Freeform cogeneration and test-aware selection strategies are recommended for maximizing plausible joint fixes and test artifacts (Cheng et al., 27 Jan 2026).
- Regression and Overfitting Control: Managing the overfitting of patches to incomplete or adversarial test suites is addressed via adversarial test oracles, multi-intent reasoning, and reviewer/judge agents, but manual validation remains necessary at scale (Ye et al., 19 May 2025, Maddila et al., 24 Jul 2025).
- Human Factors and Trust: Agents that synthesize both fixes and reproduction tests, or provide explanations and supporting test oracles, yield higher reviewer trust and greater rates of code landing in production (Maddila et al., 24 Jul 2025, Cheng et al., 27 Jan 2026).
- Limitations: Real-world effectiveness is sensitive to accurate fault localization, language/toolchain support, context window size, and LLM variability/non-determinism. Specific systems may require significant engineering adaptation for new environments (Shi et al., 2 Nov 2025, Kaliutau, 9 Dec 2025).
Agent-based program repair continues to expand in capability, efficiency, and adoption, with current systems pushing toward zero-touch, scalable, semantically robust repair across diverse software ecosystems.