Live-SWE-Agent: AI-Driven Code Automation

Updated 31 March 2026

Live-SWE-Agent is an autonomous system integrating LLM-based reasoning, interactive tool interfaces, and multi-turn execution for real-time code repository analysis.
It employs advanced context augmentation, embedding-based retrieval, and ReAct-style workflows to efficiently navigate and synthesize multi-file codebases.
The system self-evolves its toolset and strategies to improve debugging, patch generation, and repository-level problem solving in dynamic environments.

A Live-SWE-Agent is a software engineering agentic system that autonomously addresses complex, multi-file, and long-range reasoning tasks in real code repositories under “live” or continuously evolving settings. Such agents leverage LLMs, act through interactive tool interfaces, navigate large codebases, and synthesize structured outputs (e.g., bug fixes, repository-level answers, code edits), employing multi-turn, feedback-driven execution loops. Live-SWE-Agents represent a new frontier in AI-powered automation for software engineering, integrating capabilities such as on-the-fly context augmentation, prompt/tool self-evolution, inference-time intervention, robust scalability, and real-world deployment in dynamic environments (Peng et al., 18 Sep 2025, Xia et al., 17 Nov 2025, Wang et al., 9 Jun 2025, Yang et al., 27 Sep 2025, He et al., 1 Mar 2026).

1. System Architectures and Agentic Workflows

Live-SWE-Agent designs instantiate ReAct-style (reasoning and acting) frameworks, enabling iterative planning, code search, retrieval-augmented reasoning, and tool execution in a tight loop. Typical system architectures comprise:

Input and Initial Parsing: The agent receives a natural-language issue or query and parses it into initial “thoughts,” often including a decomposition into subgoals or information needs.
Repository Inspection and File Navigation: Agents use toolkits with commands for directory tree traversal, file viewing, pattern-based search, and function-level chunk retrieval (e.g., GetRepoStructure, ReadFile, SearchContent) (Peng et al., 18 Sep 2025, Yang et al., 2024).
Working Memory and Context Augmentation: The agent maintains a structured working memory $M$ , growing with each action/observation. Memory is augmented through static function-chunk RAG (retrieval-augmented generation), overlapping sliding windows, and agentic embedding-based search, allowing evidence accumulation across multiple files and iterations. Embedding-based retrieval employs cosine similarity between query and code chunk embeddings: $s_j = \cos(E(Q), E(c_j))$ (Peng et al., 18 Sep 2025).
Multi-Hop Reasoning and Planning: Agents iteratively think about $M$ and current query $Q$ , select actions via an implicit policy $\pi(a'|\tau, M)$ , and execute tool invocations to retrieve additional information, synthesize code, or apply patches.
Synthesis and Finalization: Upon gathering sufficient evidence or hitting a step bound, the agent synthesizes a final answer or patch with detailed reasoning chains and citational references (Peng et al., 18 Sep 2025).
Real-Time Self-Evolution: Advanced agents (e.g., Live-SWE-agent) can continuously extend their own action/tool space during execution by creating or modifying scripts and immediately integrating them into their workflow, driven by runtime reflection prompts (Xia et al., 17 Nov 2025).

A generalized agent loop can be formally represented as:

\begin{algorithm}[H]
\caption{SWE-QA-Agent Main Loop}
\label{alg:live_swe_agent}
\begin{algorithmic}[1]
  \Require Query~%%%%5%%%%, Repository~%%%%6%%%%
  \Ensure Answer~%%%%7%%%%
  \State %%%%8%%%%  \Comment{Working memory}
  \State %%%%9%%%%
  \State %%%%10%%%%
  \For{%%%%11%%%%} 
    \State %%%%12%%%%
    \State %%%%13%%%%
    \If{%%%%14%%%%}
      \State %%%%15%%%%
    \ElsIf{%%%%16%%%%}
      \State %%%%17%%%%
    \ElsIf{%%%%18%%%%}
      \State %%%%19%%%%
    \ElsIf{%%%%20%%%%}
      \State \textbf{break} 
    \EndIf
    \State %%%%21%%%%
    \If{%%%%22%%%%}
      \State \textbf{break}
    \EndIf
  \EndFor
  \State %%%%23%%%%
  \State \Return %%%%24%%%%
\end{algorithmic}
\end{algorithm}

(Peng et al., 18 Sep 2025)

2. Self-Evolution, Specialization, and Reflexivity

A distinguishing characteristic of recent Live-SWE-Agents is their ability to self-evolve not only the codebase they modify but their own agent scaffold at runtime (Xia et al., 17 Nov 2025). The critical elements are:

Dynamic Tool Synthesis: After each tool use, a reflection prompt asks the agent whether the historical trajectory suggests a new or modified tool would be useful. If so, the agent emits a script (e.g., Python file in a tools/ directory), which is instantly registered and becomes available for invocation in future steps.
Online Mutation/Selection: The agent's state $\theta$ (prompt, toolset) mutates by tool creation; only utilized tools persist across the current task. No offline retraining or scaffold search is required, unlike Darwin-Gödel Machine (DGM) and related frameworks, reducing infrastructure cost to zero during test-time adaptation.
Task-Specificity and Adaptation: Tool invention is often tailored to complex patterns detected during execution (e.g., custom code analyzers, binary parsers, AST matchers, cross-language extenders), exposing semantics beyond basic shell commands and greatly improving solve rates for hard tasks (Xia et al., 17 Nov 2025).

The impact of self-evolution is quantifiable. For instance, Live-SWE-agent with Claude 4.5 Sonnet achieves a solve rate of 75.4% on SWE-bench Verified and 45.8% on SWE-bench Pro, outperforming both mini-SWE-agent and state-of-the-art handcrafted baselines (Xia et al., 17 Nov 2025).

Agent	Backbone LLM	Solve-rate SWE-bench Verified	Solve-rate SWE-bench Pro
mini-SWE-agent	Claude 4.5 Sonnet	70.6 %	–
Live-SWE-agent	Claude 4.5 Sonnet	75.4 %	45.8 %

3. Embedding-Based Retrieval, Multi-Hop Reasoning, and Context Management

Live-SWE-Agents mitigate token/window limitations and accommodate long-range dependencies by employing hybrid context management strategies (Peng et al., 18 Sep 2025, Zhang et al., 29 May 2025):

Function-chunk RAG: Codebases are partitioned into atomic functional units, each embedded for dense retrieval against the query. This enables semantically relevant cross-file and cross-language retrieval, essential for multi-hop question answering and deep reasoning.
Sliding-Window RAG: For large files, context windows are generated with overlap; the most relevant windows are selected via embedding similarity.
Agentic Search Augmentation: Action-triggered searches pool candidate chunks and windows into a vector store, allowing dynamically focused semantic retrieval throughout the agent's trajectory.
Context Budget Partitioning: In multi-agent designs (e.g., SWE-Adept), token budgets are strictly divided between collaborating agents (localization and resolution), guaranteeing that no single stage swamps system resources (He et al., 1 Mar 2026).

This combination allows Live-SWE-Agents to track long causal chains, jointly synthesize across distributed code artifacts, and answer procedural or architectural questions otherwise inaccessible.

4. Evaluation Methodologies and Empirical Performance

Live-SWE-Agents are evaluated on a spectrum of real-world, repository-level benchmarks, including SWE-bench Verified, SWE-Bench Pro, SWE-QA, and SWE-bench-Live (Peng et al., 18 Sep 2025, Wang et al., 9 Jun 2025, Yang et al., 27 Sep 2025, He et al., 1 Mar 2026, Xia et al., 17 Nov 2025, Zhang et al., 29 May 2025). Core metrics include:

Resolved Rate (RR): Fraction of instances where the agent's patch or answer fully passes reference tests or satisfies reference criteria.
Pass@k: Probability that at least one out of $k$ generated candidates resolves the task.
Scoring Criteria (for QA): Correctness, completeness, relevance, clarity, and reasoning, often scored 1–5 then aggregated to [0,100].
Efficient Execution: Average latency per action, iteration counts until success, and throughput in live settings are tracked, e.g., 2–4 s per action, ~2 minutes per end-to-end run for SWE-Dev (Wang et al., 9 Jun 2025).

Recent results demonstrate substantial improvements over non-agentic or static approaches, especially for complex, multi-file tasks. On SWE-QA, the Live-SWE-Agent framework (SWE-QA-Agent) attains an overall best score of 47.82 (Claude 3.7 Sonnet), versus 36.08 for direct prompting (Peng et al., 18 Sep 2025). For open-source SWE-Dev, scaling interaction rounds from 30 to 75 improves the resolve rate from 34.0% to 36.6% (32B model), approaching closed-source systems (Wang et al., 9 Jun 2025).

5. Limitations, Open Challenges, and Future Research

Despite gains, Live-SWE-Agents face persistent challenges:

Procedural and Cross-File Reasoning: “Where” and “How” queries, involving intricate data flow and cross-file control transfer, consistently yield lower performance (scores ~37–38/100) due to inability to aggregate and synthesize across disparate code contexts (Peng et al., 18 Sep 2025).
Static vs. Dynamic Understanding: Reliance on static code parsing misses behaviors due to runtime features (dynamic imports, reflection), motivating hybrid static-dynamic integration (Peng et al., 18 Sep 2025, Xia et al., 17 Nov 2025).
Token/Context Swamping: Extremely deep call chains or wide dependency graphs can overwhelm even advanced retrieval-augmented frameworks.
Task and Environment Diversity: Most systems are tuned for Python; extension to polyglot or enterprise-scale codebases remains a significant hurdle (He et al., 1 Mar 2026, Zhang et al., 29 May 2025).
Human-Agent Collaboration: For real-world IDE-embedded agents, trust, explanation calibration, synchronization with developer effort, and controlling scope of autonomous edits persist as unsolved usability bottlenecks (Kumar et al., 14 Jun 2025).
Process-Level Failure Correction: Inefficiencies (redundancy, looping, nontermination) in agentic execution traces are common. Integration of Process Reward Models (PRMs) can increase the resolution rate (e.g., +10.6% absolute, from 40.0% to 50.6% on SWE-bench Verified) by providing taxonomy-guided trajectory corrections at inference time (Gandhi et al., 2 Sep 2025).

Prospective directions include RL-driven policy optimization over agent trajectories, dynamic and continual toolset evolution, cross-language retrievers, integration of debate-style multi-agent verification, and continual adaptation in CI/CD workflows (Peng et al., 18 Sep 2025, Xia et al., 17 Nov 2025, He et al., 1 Mar 2026).

6. Scalability, Infrastructure, and Benchmarking in Live Environments

To operate at scale and support continuous evaluation and learning, Live-SWE-Agents depend on robust infrastructure:

Automated Environment Preparation: Live, per-instance Docker containerization with time-machine dependency pinning ensures reproducibility for each benchmark task (Zhang et al., 29 May 2025, Chen et al., 3 Aug 2025).
Distributed Evaluation Harness: Ray-based and similar frameworks orchestrate parallel patch test, validation, and scoring, supporting 7K+ tasks with storage and execution speed optimizations (Chen et al., 3 Aug 2025).
Continuous Benchmarking: SWE-bench-Live and related benchmarks provide regularly updated pools of tasks (e.g., 1,319 tasks from 93 repos created since 2024) to evaluate agent generalization in real time, revealing performance drops compared to static settings (e.g., 43.2%→19.25%, $\Delta$ =23.95%) (Zhang et al., 29 May 2025).
Process Monitoring and Adaptive Scaling: Agents monitor their own success/failure statistics and, in production, can feed trajectory data back into continual RL/finetuning pipelines for online improvement (Chen et al., 3 Aug 2025).

Live-SWE-Agents thus represent the confluence of LLM-based reasoning, agentic action, environment orchestration, and adaptive learning, setting the state-of-the-art for autonomous, scalable, and domain-general software engineering automation.