Papers
Topics
Authors
Recent
2000 character limit reached

Agentic Software Issue Resolution

Updated 31 December 2025
  • Agentic software issue resolution is a paradigm that employs autonomous agents, powered by LLMs and multi-agent collaboration, to localize, repair, and validate software issues.
  • It utilizes sequential, decision-driven workflows combined with iterative test validations and graph-based reasoning to overcome local optima in code maintenance.
  • Emerging frameworks integrate techniques like competitive debate, reinforcement learning, and formal verification to enhance reliability and scalability in automated patching.

Agentic software issue resolution refers to the application of autonomous agents, typically powered by LLMs and tool integrations, to automatically localize, repair, and validate fixes for real-world software repository issues (e.g., bugs and feature requests). This paradigm represents a shift towards sequential, decision-driven, feedback-centric workflows in software maintenance—contrasting with earlier, single-pass code generation methods. State-of-the-art frameworks organize agents to perform long-horizon reasoning, iterative exploration of codebases, and multi-stage verification, often leveraging methods such as multi-agent collaboration, specification inference, Monte Carlo Tree Search (MCTS), and competitive debate protocols. The field is characterized by rich taxonomies of issue types and agent errors, rigorous process-centric evaluation, and emerging best practices for scalability and trustworthiness.

1. Formal Problem Definition and Core Workflows

Agentic issue resolution formalizes software maintenance as a sequential decision process over code repositories, represented by a Markov Decision Process (MDP) (Jiang et al., 24 Dec 2025). The state space comprises the repository snapshot, natural-language issue description, and test status: s=(C,D,T)s = (C, D, T). The agent action space includes code navigation, reading, editing, test invocation, and patch submission. Transitions are defined by tool effects and codebase changes, with rewards assigned for outcome-based patch validation (all tests pass) or process-based metrics (correct localization, plausible patch, incremental test successes). The objective is to learn a policy π(as)\pi(a|s) maximizing expected returns J(π)J(\pi) with discount factor γ\gamma.

Canonical workflows proceed through: repository preprocessing (structure or static graph extraction); issue localization; patch repair; automated patch validation (test suites); and patch selection. Advanced frameworks integrate iterative refinement, multi-agent voting, explainability generation, and agentic reinforcement learning—enabling deeper reasoning and robustness compared to prior prompt-based approaches.

2. Framework Architectures and Representative Systems

Recent agentic frameworks are distinguished by their orchestration strategies, multi-agent pipelines, and integration with analysis tools:

  • SWE-Debate adopts a competitive multi-agent debate structure. Dependency-graph builders generate fault propagation chains, agents engage in three rounds of competitive reasoning (chain voting, plan proposal, plan synthesis), and fixes are generated via MCTS seeded by the consensus plan. This protocol enables agents to overcome local optima and collaboratively converge on repository-level solutions (Li et al., 31 Jul 2025).
  • SWE-Exp introduces persistent experience banks: agent repair trajectories (successes/failures) are distilled, embedded, and retrieved to inform subsequent agent reasoning, yielding strategic reuse across issues aligned by semantic similarity (Chen et al., 31 Jul 2025).
  • TDFlow decomposes repair into specialized sub-agents (test generation, patch proposing, debugging, revision), explicitly separating concerns and mitigating context overload while maximizing pass rates under test-driven constraints (Han et al., 27 Oct 2025).
  • OpenHands, SWE-Agent, AutoCodeRover, and Mini SWE Agent embody archetypal orchestrations: ReAct-style loops, multi-agent delegation, phase-based rigid pipelines, and minimal shell-driven cycles. Energy efficiency studies show that architecture strongly determines computational overhead, with passive, repetitive looping or unfiltered context feeds leading to substantial waste when used with small LMs (Tripathy et al., 10 Dec 2025).
  • Process-centric analysis (Graphectory) formalizes each agent trajectory as a directed multigraph connecting actions, context levels, and observed outcomes; enables metrics-based evaluation of reasoning depth, exploration, and inefficiencies (Liu et al., 2 Dec 2025).

3. Issue Localization and Error Taxonomies

Localization is central to effective agentic resolution. Methods range from single-pass embedding retrieval and reranking (SweRankMulti), to agentic, iterative “search–reason–reformulate–aggregate” loops (SweRankAgent), which offer superior results in multilingual and complex codebases (Reddy et al., 23 Dec 2025). Issue types in agentic systems span:

Category Example Subtypes Prevalence
LLM Operation Issues Model access config, token misconfiguration 31.8%
Tool Issues Dependency, implementation, misuse 18.4%
Utility Issues Logging/UI, configuration, dependencies 20.9%
Memory Issues Initialization, content, dependency errors 14.4%
Workflow Issues Deadlocks, infinite loops, step order errors 6.9%
LLM Provider Issues Incompatibilities, unsupported models, params 7.5%

Benchmarks such as AGENTISSUE-BENCH rigorously capture these categories and expose low (<13%) correct resolution rates on LLM-based agent systems, especially for non-traditional fault types (LLM, memory, workflow) (Rahardja et al., 27 May 2025).

Agentic trace error taxonomies partition failures into reasoning errors (hallucinations, misinterpretation, decision, output), system execution errors (configuration, API, resource), and planning/coordinational errors (context, resource, task management) (Deshpande et al., 13 May 2025). Structured traces permit fine-grained evaluation beyond outcome-based tests, facilitating development of self-debugging agents.

4. Debate, Collaboration, and Experience-driven Reasoning

Multi-agent debate protocols and experience banking address central shortcomings in conventional agentic systems: local optima and the absence of knowledge transfer.

  • SWE-Debate’s competitive debate induces diversity in fault localization by generating and ranking multiple propagation traces over the code dependency graph. Subsequent rounds synthesize and refine fix plans before seeding MCTS exploration (Li et al., 31 Jul 2025).
  • SWE-Exp captures high-level diagnostic perspectives and patching patterns, leveraging them via embedding retrieval in subsequent repair attempts. Ablation studies demonstrate 3–6% absolute gains in state-of-the-art resolution rates, mainly through high-level comprehension reuse (Chen et al., 31 Jul 2025).
  • StepFly extends these agentic principles to incident management, using a DAG-structured troubleshooting guide, plugin extraction, and parallel execution to achieve high automation rates (Mao et al., 11 Oct 2025).

Competitive debate and experience integration enable agents to overcome local search traps, address distributed or cross-cutting faults, and adapt to structurally diverse codebases.

5. Verification, Validation, and Trustworthiness

Due to the explosive growth in automated patching, agentic systems increasingly integrate formal and statistical verification modules:

  • Agents synthesize formal specifications via intent inference (e.g., utilization of pre-post conditions inferred from natural-language issues) (Roychoudhury, 24 Aug 2025).
  • Multi-tiered verification: pass rates, specification conformance, and mutation scores provide layered assurance (Roychoudhury, 24 Aug 2025).
  • LLM-as-a-Judge modules filter patches against semantic or code-style violations before human review, achieving up to 86.7% precision on patch rejection (Maddila et al., 24 Jul 2025).

Trust frameworks track patch success rates, post-deployment defect rates, and deployment mode switching (autonomous vs. assisted) based on running confidence scores. Transparent reasoning (intent summaries, AST differencing, explanation generation) is emphasized for team adoption.

6. Evaluation Benchmarks, Metrics, and Empirical Findings

Agentic resolution is empirically benchmarked using curated datasets (SWE-bench Verified, AGENTISSUE-BENCH, StepFly DAGs, GITS-Eval, TRAIL traces) and metrics:

  • Pass@k: ratio of issues resolved within k patches.
  • Localization accuracy: file and function-level edit correspondence.
  • Process-centric: node counts, edge counts, loop lengths in agent trajectories (Liu et al., 2 Dec 2025).
  • Non-functional: cyclomatic complexity, code duplication, code smells, reliability/security metrics (Chen et al., 2024).

Findings indicate no single agent dominates in resolution rate, and different agents often solve complementary subsets of benchmark issues. Over-modification and redundant exploration remain significant risks, especially in complex codebases. Process-centric metrics correlate with issue complexity and agent success, implicating exploration depth and validation thoroughness (Liu et al., 2 Dec 2025). High pass rates (e.g., TDFlow’s 94.3% under gold tests) depend critically on test quality over agent creativity (Han et al., 27 Oct 2025).

7. Limitations, Open Challenges, and Future Directions

Current limitations include local solution traps, missing dynamic (runtime) dependencies, context loss in long codebases, insufficient modeling of agent memory and conversational state, and non-determinism in LLM outputs. Key research directions highlighted:

The field is transitioning toward agentic systems with persistent memory, competitive collaboration, self-debugging, and continuous verification, as well as scalable best practices for benchmark curation, auditability, and domain adaptation (Jiang et al., 24 Dec 2025, Chen et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Agentic Software Issue Resolution.