LLM-Based Agents in Software Engineering

Updated 1 December 2025

LLM-based agents are systems that integrate language models with planning, memory, and tool invocation to handle multi-step software engineering tasks.
They employ role-based cooperation, self-reflection, and modular pipelines to decompose and execute processes across the entire software development lifecycle.
Research identifies challenges in requirement grounding, memory management, and coordination overhead, offering insights for future advancements in automated software engineering.

LLM-based agents in software engineering are autonomous or semi-autonomous systems that integrate one or more LLMs with explicit modules for planning, memory, tool invocation, and perception, enabling them to perform complex, multi-step SE tasks across the software development lifecycle (SDLC). These agents extend the capabilities of standalone LLMs beyond mere prompt-response interactions by maintaining state, decomposing tasks, reasoning about tool outputs, invoking external resources, and—via modular or multi-agent orchestration—addressing requirements, implementation, testing, documentation, and maintenance. Their architectures range from simple single-agent pipelines to sophisticated multi-agent platforms aligned with agile roles, employing reflection, tool integration, hierarchical memory, and human-in-the-loop oversight. This article surveys foundational definitions, agent architectures, operational mechanisms, evaluation methodologies, practical deployments, and emerging research directions, providing a comprehensive technical reference on LLM-based agents in software engineering.

1. Core Definitions and Formal Models

An LLM-based agent in software engineering is formally defined as a tuple comprising at minimum: a perception module to encode environmental observations as LLM-ready context, a memory module (semantic, episodic, procedural) to persist relevant information and serve as intermediate context, and an action policy that may trigger both internal reasoning and external tool invocations to maximize a task-specific utility function (e.g., pass rate, requirement satisfaction) (Wang et al., 13 Sep 2024). The agent’s policy is typically parameterized as $\pi_\theta(\mathcal{I} \times \mathcal{M} \rightarrow \mathcal{A})$ , where $\mathcal{I}$ is the input embedding, $\mathcal{M}$ is memory, and $\mathcal{A}$ is the set of possible actions.

In multi-agent systems, additional orchestration logic governs role allocation, communication topology (centralized, decentralized, hierarchical, or nested), and collaborative/competitive paradigms (He et al., 7 Apr 2024, Cai et al., 11 Nov 2025). An LLM-based MAS can be abstracted as a set of such agent tuples organized under an orchestration platform that manages state transitions, message routing, and coordination, often employing reflection and “rethink” steps after every action.

Agent types can be grouped as:

Single-agent: Monolithic pipelines (plan–generate–validate).
Multi-agent: Modular pipelines with agents for planning, coding, testing, reviewing, etc.; typically assigned explicit SE roles.

LLM-based agents must satisfy autonomous criteria: internal planning, persistent memory, multi-turn interaction, tool integration, decision-making, option selection, and adaptive behavior (Jin et al., 5 Aug 2024).

2. Agent Architectures and Workflow Taxonomy

LLM-based agent architectures in SE range from elementary prompt-based systems to highly modular, multi-agent frameworks. The principal design patterns include:

Role-Based Cooperation: Agents are assigned discipline-specific roles (e.g., product owner, architect, developer, tester, reviewer) and operate in pipelines or cooperative groups (Cai et al., 11 Nov 2025, Tawosi et al., 3 Oct 2025, Lin et al., 23 Mar 2024).
Self-Reflection and Cross-Reflection: Agents revisit and introspect on actions/results to revise future steps or critique peer outputs, often reducing hallucination and improving solution quality (Oueslati et al., 5 Nov 2025, Cai et al., 11 Nov 2025).
Tool-Augmented Loops: Agents interact with the environment by issuing tool calls (file operations, build systems, static/dynamic analyzers, web search, etc.); tool APIs return artifacts or feedback consumed in the next agent iteration (Xia et al., 1 Jul 2024, Team et al., 31 Jul 2025, Qiu et al., 17 Nov 2025).
Memory-Oriented Design: Agents deploy hierarchical or hybrid memory—short-term for current artifacts, long-term for prior interactions, and structural memory (project graphs, summaries) to alleviate context-window limits and maintain goal focus (Tawosi et al., 3 Oct 2025).
Human-in-the-Loop Integration: Agents systematically incorporate human validation/feedback at requirements, planning, code review, or deployment gates to ensure trustworthiness, regulatory compliance, and correctness (Takerngsaksiri et al., 19 Nov 2024, Ronanki, 7 May 2025).

The orchestration logic can be:

Pipelined (e.g., Waterfall, Agile): Stages are executed in sequence, with artifacts handed off deterministically (requirements → design → code → test → review) (Lin et al., 23 Mar 2024, Tawosi et al., 3 Oct 2025).
Debate/Consensus: Parallel agents propose solutions, then coordinate via voting or critique to select the best outcome (Cai et al., 11 Nov 2025).
Layered/Hierarchical: Agents are deployed or escalated dynamically based on task complexity or failure recovery requirements (Cai et al., 11 Nov 2025).

Representative frameworks include ChatDev, MetaGPT, ALMAS, HULA, FlowGen, and RefAgent, many of which implement role-based and self-reflective patterns (He et al., 7 Apr 2024, Lin et al., 23 Mar 2024, Takerngsaksiri et al., 19 Nov 2024, Tawosi et al., 3 Oct 2025, Oueslati et al., 5 Nov 2025).

3. Operational Mechanisms: Planning, Memory, Reasoning, and Tool Use

Planning and Decomposition: Agents decompose high-level goals into sub-tasks using chain-of-thought, structured prompts, or learned SOPs, then map sub-tasks to specialized agents (or self-steps in single-agent systems) (Wang et al., 13 Sep 2024, Tawosi et al., 3 Oct 2025, Lin et al., 23 Mar 2024). Planning can be single-path (linear) or multi-path (branch, merge, filter).

Memory Augmentation:

Semantic memory retrieves external documents, API specs, or repository summaries to enrich prompts at every step.
Episodic memory persists intermediate artifacts, localization steps, and full $(t,a,r)$ (thought–action–result) trajectories for subsequent access and feedback (Bouzenia et al., 23 Jun 2025).
Procedural memory encodes role-specific heuristics, dynamic prompt templates, or skill-weights tuned via in-context learning or parameter-efficient fine-tuning (Wang et al., 13 Sep 2024, Cai et al., 11 Nov 2025).

Tool Use and Feedback: LLM-based agents invoke tools via structured APIs (e.g., file edit, run tests, compilation, search) and process the feedback (test results, compiler output, log messages) to guide subsequent reasoning. Tool-invoking agents can operate in stateless or memoryful ReAct-style cycles (Qiu et al., 17 Nov 2025, Xia et al., 1 Jul 2024, Team et al., 31 Jul 2025).

Self-Reflection and Verification: Agents, especially in multi-agent systems, perform iterative self-evaluation (run/generate/refine) by integrating test failures, code review critiques, or formal verification outputs as feedback for plan updating or patch refinement (Oueslati et al., 5 Nov 2025, Cai et al., 11 Nov 2025).

Anti-pattern Detection: Research on agent action trajectories reveals common failure modes—repetitive actions without follow-up, absence of intermediate test validation, and premature termination—which can be mitigated by architectural constraints, adaptive prompts, and explicit plan verification (Bouzenia et al., 23 Jun 2025).

4. Application Domains and Benchmarking

LLM-based agents are deployed across a spectrum of SE tasks:

SE Task	Example Agent System	Key Evaluation Metric
Code Generation	CodeAgent, ALMAS	Pass@k, Correctness, Syntactic Validity
Program Repair & Issue Fixing	RepairAgent, Trae Agent	Pass@1, Correct Patch Rate, Test Coverage
Refactoring	RefAgent	Code Smell Reduction, Test Pass Rate
Requirements Engineering	MARE, HULA	Precision, Recall, Task Fulfillment
Full SDLC Automation	ChatDev, MetaGPT, ALMAS	Fulfillment Rate, End-to-End Pass Rate
Project Management/Agile	CogniSim/CognitiveAgent	Backlog Reduction, Delivery Quality

Benchmarking frameworks include SWE-bench family (Lite, Verified), E2EDevBench, AGENTISSUE-BENCH, CodeAgentBench, LoCoBench-Agent, and various HumanEval/MBPP-derived datasets for function-level tasks (Xia et al., 1 Jul 2024, Zeng et al., 6 Nov 2025, Rahardja et al., 27 May 2025, Qiu et al., 17 Nov 2025, Lin et al., 23 Mar 2024).

Advanced evaluation involves not only test-pass rates and syntactic correctness but functional coverage across original and agent-created tests, LLM-based requirement verification, code-smell density via static analysis, and cross-session memory/consistency (Qiu et al., 17 Nov 2025, Oueslati et al., 5 Nov 2025, Zeng et al., 6 Nov 2025).

5. Empirical Results, Architectural Tradeoffs, and Observed Best Practices

Recent large-scale studies indicate:

Multi-agent, role-divided pipelines outperform monolithic or single-turn approaches on complex SE tasks, yielding higher correctness (e.g., RefAgent's 90% Java unit-test pass rate and 52.5% code smell reduction (Oueslati et al., 5 Nov 2025)).
Autonomous systems typically solve 30–60% of non-trivial GitHub issue tasks at repository scale; bottlenecks include requirement comprehension, error propagation from upstream planning, and insufficient self-verification (Zeng et al., 6 Nov 2025, Xia et al., 1 Jul 2024, Team et al., 31 Jul 2025).
Simple, modular designs such as Agentless achieve comparable performance to stateful agents on benchmarks like SWE-bench Lite with far lower computational cost, when the task allows rigid decomposition (Xia et al., 1 Jul 2024).
Long-context benchmarks (LoCoBench-Agent) demonstrate that agent comprehension degrades minimally at very large context windows, yet significant comprehension–efficiency tradeoffs and memory retention challenges persist (Qiu et al., 17 Nov 2025).
Best practices include separating planning from execution, explicit alignment between agent “thoughts” and actions, maintaining experience/trajactory buffers for meta-learning, embedding critique or reflection after each iteration, and, where possible, hybridizing tool use with retrieval augmented LLM prompting (Bouzenia et al., 23 Jun 2025, Oueslati et al., 5 Nov 2025).

6. Challenges, Limitations, and Future Directions

Major unresolved technical challenges include:

Requirement grounding: Failure rates are dominated by requirement omission and misinterpretation; enhanced requirement engineering modules, structured templates, and iterative coverage validation are needed (Zeng et al., 6 Nov 2025).
Memory and context: Context-window limits and weak multi-session retention hinder global reasoning and long-range dependency management. External memory, dynamic summarization, and hierarchical retrieval are open research areas (Tawosi et al., 3 Oct 2025, Qiu et al., 17 Nov 2025).
Agent coordination overhead: Multi-agent communication can incur O( $N^2$ ) scaling, introduce deadlocks, and drive up API and computation costs (Dong et al., 31 Jul 2025, Cai et al., 11 Nov 2025).
Tool/Environment reliability: SE agents are sensitive to environment nondeterminism, flaky LLM outputs, and fast-evolving toolchains/APIs (Rahardja et al., 27 May 2025).
Human–AI synergy and trust: Human-in-the-loop is essential for compliance and high-precision settings, but remains labor-intensive without robust role allocation frameworks (e.g., RACI), audit traces, and explainability (Ronanki, 7 May 2025, Takerngsaksiri et al., 19 Nov 2024).
Benchmarking gaps: Lack of unified, high-fidelity, large-scale benchmarks that cover end-to-end SE processes, including artifact traceability, performance, maintainability, and developer revision effort (Zeng et al., 6 Nov 2025, Cai et al., 11 Nov 2025).

Research opportunities include developing dynamically adaptive agent topologies, integrating formal verification and safety constraints, scalable blackboard/tri-tier memory architectures, multi-modal perception (UI, diagram, code property graphs), and agent-centric qualification metrics that extend beyond simplistic pass@k measures (Dong et al., 31 Jul 2025, Guo et al., 10 Oct 2025, Takerngsaksiri et al., 19 Nov 2024).

7. Design Space, Best Practices, and Implications

Analyses of 94+ studies identify key architectural and workflow best practices (Cai et al., 11 Nov 2025):

Emphasize functional correctness and maintainability by rigorous specification adherence, role-based decomposition, and continuous validation.
Prefer modularity via role-based cooperation, formalized agent interfaces, and hierarchical/adapter patterns for extensibility.
Leverage self- and cross-reflection patterns to mitigate hallucinations and enforce semantic consistency.
Manage resource–quality tradeoffs through incremental querying, dynamic agent deployment (hierarchical coordination), and judicious activation of human critique.
Preserve adaptability via plug-and-play agent architectures, separated tool registries, and reuse of retrieval-augmented generation.
Address trust, security, and regulatory concerns by enforcing human accountability in sensitive steps, audit trails, and formal resource management policies.

The emerging consensus is that LLM-based multi-agent systems, when architected with careful modularity, role alignment, and robust memory/tooling backbones, offer a promising solution space for automating large swaths of the SE workflow—contingent upon advances in requirement engineering, context management, and robust, multi-dimensional evaluation (Cai et al., 11 Nov 2025, He et al., 7 Apr 2024).