Papers
Topics
Authors
Recent
Search
2000 character limit reached

SE-Agent: Autonomous Software Engineering

Updated 9 May 2026
  • SE-Agent is an LLM-driven autonomous system that automates software engineering tasks such as bug fixes, feature additions, and code reviews.
  • It employs a multi-round workflow using file edits, shell commands, and test executions to validate changes within source repositories.
  • Empirical benchmarks and studies highlight challenges in localization accuracy and agent-specific issues, underscoring the need for robust multi-agent protocols.

A Software Engineering Agent (SE-Agent) is an autonomous system—typically orchestrated by LLMs—designed to perform complete software engineering tasks such as bug fixing, feature implementation, or code review without human intervention. SE-Agents operate by invoking a repertoire of environment actions (e.g., file edits, shell commands, tool invocations, test execution) under symbolic control of LLMs, interfacing directly with source repositories and development environments. This framework, emerging as a key paradigm in SE 3.0, subsumes single-agent and multi-agent architectures and is widely studied as both an automation and augmentation tool for modern software engineering workflows (Rahardja et al., 27 May 2025).

1. Formal Definition, Core Architecture, and Exemplars

An SE-Agent is formally defined as an LLM-based autonomous assistant with the express purpose of resolving user-reported software issues—typically bug fixes or feature requests—via a sequence of actions over a software project (Rahardja et al., 27 May 2025). The typical operational interface includes mechanisms for codebase navigation, search, edit, build/test, and validation.

Prominent SE-Agent frameworks include Agentless, AutoCodeRover, and SWE-agent (Rahardja et al., 27 May 2025). Each encapsulates a multi-round workflow:

  • Issue is described as natural language input.
  • Agent navigates the repository, identifies a candidate fix or feature addition, edits files, and applies changes.
  • Candidate solution is tested by running the project’s test suite.
  • Confirmed solutions are committed as patches or pull requests.

Agents interface with platforms via abstractions such as “Agent–Computer Interface,” which mediate tool and repository APIs. The architecture is typically LLM-driven, with tool-calling abilities delegated to plugins or internal modules, and the execution loop is autonomous unless manually interrupted.

2. Agent-Issue Taxonomy, Benchmarking, and Evaluation

A critical line of research examines the unique failure modes of SE-Agents, especially when deployed to maintain LLM-driven agent systems—creating a “dual SE” challenge. In (Rahardja et al., 27 May 2025), a comprehensive empirical study constructed a dataset of 201 real-world issues from 16 actively maintained agent repositories (e.g., MetaGPT, AutoGen, GPT-engineer), leading to a 6-class taxonomy:

  1. Incompatibility with LLM Providers: dependency mismatches, unsupported models, API parameter inaccuracies.
  2. Tool-Related: errors in external or internal tools, misuse of tool interfaces.
  3. Memory-Related: faulty workspace or DB resets, missing contents, memory module bugs.
  4. LLM Operation: API misconfigurations, token-limit errors, context truncation, prompt handling.
  5. Workflow Issues: deadlocks, unexpected control flow.
  6. Utility Issues: errors in non-LLM components (UI, logging), dependency/circular-import issues.

To benchmark real-world difficulty, AgentIssue-Bench was developed—a reproducible library of 50 tasks, each in a Docker container with buggy commit, descriptive issue report, failure-triggering test, developer patch, and full environment capture (Rahardja et al., 27 May 2025). The composition reflects the original 201-issue distribution. Metrics for evaluation include:

  • File/function-level localization accuracy (does the agent patch same loci as the ground truth?).
  • Plausible resolution rate: Rplausible=Ntests passNtotal×100%R_\text{plausible} = \frac{N_{\text{tests pass}}}{N_{\text{total}}}\times 100\%
  • Correct resolution rate: Rcorrect=Nsemantic matchNtotal×100%R_\text{correct} = \frac{N_{\text{semantic match}}}{N_{\text{total}}}\times 100\%

State-of-the-art agents achieved only 3.33%–12.67% correct resolution, with file-level localization rates below 26% (Rahardja et al., 27 May 2025). This performance is sharply lower than on traditional software benchmarks (SWE-bench, 23.2%–50.8%), signifying unique and unresolved challenges endemic to agentic systems.

3. Multi-Agent Protocols and Structuring Principles

Multi-agent SE-Agents—those coordinating teams of LLMs with explicit roles (e.g., coder, reviewer, auditor)—require principled protocols for robust operation (Mao et al., 14 Oct 2025). The SEMAP methodology introduces a protocol-driven framework grounded in the following structuring principles:

  • Behavioral Contracts: Each agent’s role specifies preconditions (input artifacts) and postconditions (expected outputs), checked before and after invocation. For example, the “Coder” contract is

CCoder=(Coder,  {task_spec,  plan},  {code_artifact}).C_{\text{Coder}} = \bigl(\text{Coder},\;\{\texttt{task\_spec},\;\texttt{plan}\},\;\{\texttt{code\_artifact}\}\bigr) .

  • Structured Messaging: All inter-agent communication is typed, schema-validated, and partitioned (e.g., TaskAssignment, TaskResult, ReviewRequest) as JSON payloads, enabling modularity and enabling theoretical correctness guarantees.
  • Lifecycle-Guided Execution: Agent teams operate as finite state machines, with explicit transitions (e.g., Implementing → Reviewing → Completed/Failed) triggered by verifiers at each phase.
  • Embedded Verification: Automatic and external verifiers test for syntax, semantic correctness, coverage, and review completeness at each transition point. Any fail leads to explicit loops for repair. Empirical studies report up to 69.6% reduction in failure counts for function-level development and up to 47.4% for vulnerability detection compared to baseline multi-agent protocols (Mao et al., 14 Oct 2025).

4. Self-Evolving and Stepwise Optimization Paradigms

Contemporary research has introduced self-evolutionary paradigms to optimize SE-Agent reasoning and action trajectories. In “SE-Agent: Self-Evolution Trajectory Optimization” (Lin et al., 4 Aug 2025), the agent’s multi-step problem-solving process is explicitly modeled as a pool of trajectories T0T1\mathcal{T}_0 \to \mathcal{T}_1 \to \cdots, evolving through:

  • Revision: Each candidate trajectory is revised via reflection and targeted mutation to fix identified weaknesses.
  • Recombination: High-performing trajectory segments are recombined (crossover, transfer, restructuring) to form new solution hypotheses, leveraging cross-trajectory inspiration.
  • Refinement: Multi-dimensional rewards score all candidates, and a selection operator retains the most diverse and high-value solutions. Unlike classical Monte Carlo Tree Search, this paradigm expands the search space combinatorially and harnesses cross-trajectory variation, achieving up to 55% relative improvement in first-pass resolution on SWE-bench Verified (Lin et al., 4 Aug 2025). Ablations reveal that all three evolutionary stages—revision, recombination, and refinement—are crucial for optimal performance.

5. Industrial Evidence, Trust, and Human-Agent Collaboration

Empirical studies using large-scale datasets such as AIDev (Li et al., 20 Jul 2025) demonstrate that in industrial contexts, SE-Agents function as “AI teammates” (autonomous coding agents) able to independently create, review, and submit pull requests. These agents (e.g., OpenAI Codex, Devin, Copilot, Cursor, Claude Code) share persistent memory, planning, and handoff mechanisms. Across 456,535 PRs in 61,000 repositories, agentic PRs accelerate submission rates dramatically—one case recorded a developer submitting as many agentic PRs in three days as in the previous three years.

Key metrics from AIDev include:

  • Acceptance Rate: Humans 76.8%, OpenAI Codex 65.3%, Devin 48.9%, Copilot 38.2%, Cursor 51.4%, Claude Code 52.5%.
  • Resolution Time: 50% of Copilot jobs in 12.8 min, 75% in 18.5 min; human baselines ~5 days for major issues.
  • Code Complexity: Agentic PRs modify complexity less frequently (9.1%) compared to humans (23.3%).

A “trust and utility gap” persists: despite strong synthetic benchmark results, agent-generated code is accepted less frequently in real-world settings, often due to code style, maintainability, and project norms unmodeled by agents (Li et al., 20 Jul 2025). These findings highlight the importance of integrating accountability, style enforcement, and context-aware modeling in SE-Agent design.

6. Challenges, Limitations, and Research Directions

SE-Agents face major obstacles when addressing agent-specific issue classes:

  • Poor localization accuracy (file-level <26%, function-level <19% in (Rahardja et al., 27 May 2025)).
  • Low correct resolution for agent-specific failures (e.g., LLM provider incompatibility, memory mechanism faults, complex operation errors).
  • Overfitting to conventional “utility” bugs while failing to resolve deeper system-level failures.
  • LLM nondeterminism and unstable external service interfaces complicate reproducible evaluation.

Recommended research directions include:

  • Incorporation of explicit agent system models (e.g., memory schema, invocation protocols).
  • Enhanced prompt and workflow engineering.
  • Integration of dynamic traces (tool outputs, memory mutations) and agent debug logs for improved localization and repair strategies.
  • Continuous expansion of agent-specific benchmarks (e.g., AgentIssue-Bench) and synthesis of multi-agent debugging scenarios.
  • Hybridization of static analysis (type checks, interface contracts) with LLM-based reasoning, as well as test-case generation tailored to agent workflows (Rahardja et al., 27 May 2025).

A plausible implication is that future SE-Agents will require not only larger and more aligned LLMs but also deeper integration of software architectural knowledge, structured collaboration protocols, and transparent, auditable decision-making.

7. Design Patterns, Quality Attributes, and Methodological Taxonomy

Recent meta-analyses (Cai et al., 11 Nov 2025) identify core design patterns and rationales in LLM-based SE multi-agent systems. The most common patterns are:

  • Role-Based Cooperation (46.8%): specialized agents (planner, coder, tester) coordinated by a team leader.
  • Self-Reflection (36.2%): each agent critiques and refines its own output, often with iterative LLM loops.
  • Retrieval-Augmented Generation (10.6%) and Tool-Agent Registry (14.9%) support contextual grounding and controlled action execution.

Functional Suitability (esp. correctness), Performance Efficiency, and Maintainability emerge as the dominant quality attributes guiding design. The most frequent rationale is Improving the Quality of Generated Code, followed by Simulating Human Processes of SE. Empirical evidence shows that using these design principles can yield up to 30% improvements in test-pass rates and over 2× reductions in error rates.

Methodologically, these findings are supported by systematic reviews covering 94 source papers, using ISO/IEC 25010 quality frameworks and detailed pattern coding (Cai et al., 11 Nov 2025).


In summary, the field of SE-Agents is marked by complex system integration, rigorously benchmarked empirical evaluation, evolving multi-agent protocols, and emerging best practices for robust, trustworthy, and efficient software engineering automation. Addressing the technical and socio-technical challenges of SE-Agent deployment will require convergence of advances in LLM alignment, system-level modeling, protocol engineering, and human-agent collaboration (Rahardja et al., 27 May 2025, Mao et al., 14 Oct 2025, Li et al., 20 Jul 2025, Lin et al., 4 Aug 2025, Cai et al., 11 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SE-Agent.