Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? (2511.13646v1)

Published 17 Nov 2025 in cs.SE, cs.AI, cs.CL, and cs.LG

Abstract: LLMs are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

Summary

The paper presents a self-evolving software agent that dynamically synthesizes custom tools at runtime, achieving a 75.4% solve rate on SWE-bench Verified.
It incorporates a lightweight reflection mechanism to trigger context-specific tool synthesis, eliminating the need for costly offline training.
Empirical evaluations demonstrate that the agent’s adaptive approach improves resolution rates by up to 22.6% with advanced LLM backends across diverse benchmarks.

Live-SWE-agent: On-the-Fly Self-Evolving Software Engineering Agents

Introduction

LLMs have demonstrated significant capability in automating software engineering tasks, progressing from code completion to autonomous agents capable of repository navigation, multi-tool usage, and end-to-end issue resolution. Traditional agentic approaches in this domain depend on static scaffolds, with predetermined toolsets and workflows. Scaffolds are labor-intensive to design and severely limit adaptability. Recent advances have explored self-improving software agents, yet these approaches are stymied by high offline training costs and poor generalization across environments and LLMs. "Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?" addresses these limitations by introducing Live-SWE-agent — an architecture where the agent’s scaffold, especially its toolset, evolves online and autonomously at runtime.

Figure 1: The iterative architecture of Live-SWE-agent enables runtime reflection, dynamic tool synthesis, and on-the-fly scaffold evolution while solving real-world software issues.

Methodology: Live, Online Self-Evolution

Live-SWE-agent is architected as a minimal, extensible agentic framework, primarily building on top of mini-SWE-agent. The agent is initialized with minimal scaffold and only basic bash capabilities. During its problem-solving loop, the agent repeatedly encounters two main decision points: execution of existing commands/tools, or synthesis of custom tools.

A lightweight reflection mechanism is embedded after each environmental interaction, prompting the agent to consider whether a novel tool would expedite further progress. Custom tools, synthesized as executable scripts (primarily in Python), are constructed to suit problem-specific contexts, addressing inefficiencies or gaps in basic shell tools.

This approach eschews any need for costly offline search or supervised scaffold search. The agent creates or modifies tools as dictated by context and prior trajectory, providing immediate adaptivity. The synthesized tools immediately become first-class agents’ actions, enabling further reflective iteration.

Tool Synthesis and Adaptivity

A hallmark of Live-SWE-agent is its capacity to synthesize both generic utility tools as well as highly specialized, issue-specific analyzers during the problem-solving trajectory. The synthesis interface is kept deliberately simple—tools are bash-invokable scripts, allowing seamless integration into the agent's reasoning loop.

Figure 2: Visualization of custom tools generated by Live-SWE-agent on SWE-bench Verified, showcasing type diversification and adaptive clustering.

Synthesized tools include advanced editors, structured search utilities, file format analyzers, and more. For example, in cases where bash tools like sed fail to provide transactional guarantees or necessary feedback, the agent generates a custom Python editor that validates edit success and provides actionable status messages. Issue-specific tool synthesis is illustrated with format analyzers (e.g., a MARC file parser), enabling the agent to handle complex binary data or new file types encountered mid-trajectory.

This runtime adaptivity ensures the agent's capabilities are optimal for the specific context, rather than being limited to the union of general-purpose tools. Tool synthesis is iterative: new tools can be derived, expanded, or revised as the agent's understanding or needs evolve.

Figure 3: Analysis of tool diversity and specialization in SWE-bench Verified and SWE-Bench Pro, confirming cross-repository and cross-language adaptivity.

Empirical Evaluation

Live-SWE-agent was subjected to rigorous benchmark evaluation on SWE-bench Verified, SWE-Bench Pro, and SWE-bench Multilingual. Key experimental dimensions included:

Resolution Rate: On SWE-bench Verified, Live-SWE-agent achieves a 75.4% solve rate without test-time scaling, outperforming all open-source and several proprietary agentic competitors. On SWE-Bench Pro, it achieves a solve rate of 45.8%, the best-known open-source result.
Comparative Efficiency: Compared to offline-evolved self-improving agents such as DGM and HGM, Live-SWE-agent offers substantially higher resolution rates (65.0% on SWE-bench Verified-60 vs. 53.3% for DGM) while incurring zero offline training and specialization cost.
Generalizability: Cross-evaluations with multiple LLM backends (including GPT-5, Claude 4.5 Sonnet) show that performance improvement due to tool synthesis scales with model capability; weaker LLMs may not benefit, but advanced LLMs show up to 22.6% absolute improvement in resolution rate over static scaffolds.
Ablation: Reflection and explicit tool synthesis steps directly correlate with increased tool creativity and solve rate, confirming the necessity of both runtime reflection and open-ended tool generation for optimal performance.
Figure 4: Comparative leaderboard placement of Live-SWE-agent relative to prevailing open-source and proprietary software engineering agents across SWE-bench Verified and Pro.

Analysis of Custom Tooling

Comprehensive embedding-based analyses (e.g., t-SNE on tool script bodies) show that Live-SWE-agent generates clusters of generic tools (edit, view, search) but also a nontrivial set of unique, issue-aligned tools. Toolsets synthesized vary significantly between repositories and programming languages, reflecting optimal adaptation to context (see Figures 3 and 4). Example custom tools generated during solution trajectories exhibit:

Task-focused reduction in command complexity.
Explicit error handling and post-condition verification absent in equivalent bash commands.
Efficient context management (e.g., limiting file search outputs to avoid prompt overload).

These advantages translate directly to improved efficiency and correctness during issue resolution.

Implications and Future Directions

The practical implications of Live-SWE-agent are substantial for both research and industrial deployment:

Unified Evaluation Platform: Its model-agnostic, lightweight scaffold provides a reproducible evaluation harness for LLM-based software engineering research, unburdened by proprietary workflows or agent-specific idiosyncrasies.
Continual Adaptation: Live-SWE-agent’s runtime learning enables seamless extension to new domains—security, test generation, or full software synthesis—by simply permitting new classes of tool synthesis.
Self-Evolving Training: Beyond evaluation as a runtime agent, the paradigm can be incorporated into LLM pretraining, enabling agents to learn scaffold and toolset evolution as a first-class supervisory signal.
Skill Retention and Transfer: A future extension includes serializing, retaining, and reusing synthesized tools and emergent strategies as persistent "skills" across tasks, further amplifying adaptivity and efficiency in multi-task or production settings.

Conclusion

Live-SWE-agent presents an online, agile self-evolution paradigm for software engineering agents by elevating runtime tool synthesis and scaffold modification to the core of the agent’s decision process. This on-the-fly evolution confers superior resolution rates and cost profiles over static and offline-evolved agents, with generalizability across tasks, repositories, and LLMs. The results posit online agent self-evolution as a central ingredient in future AI-augmented software engineering workflows and propose promising directions for robust, continuously improving agent architectures.

References

"Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?" (2511.13646)

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces Live-SWE-agent, a smart computer helper that fixes real-world coding problems. The special thing about it is that it can improve itself while it’s working. Instead of only using a fixed set of tools, it can create its own tools on the spot to solve the problem faster and better.

What questions does the paper try to answer?

To make the ideas clear, here are the main questions the researchers explore:

Can a coding agent learn and improve itself during a task, not just before it?
How do we let the agent decide when to build a new tool versus when to use an existing one?
Does this “learn while working” approach beat other agents on well-known coding tests?
Can this method stay simple, cheap, and work with different AI models?

How does Live-SWE-agent work?

Think of a software agent as a “digital mechanic” working on a codebase. Most mechanics have a fixed toolbox. Live-SWE-agent starts with a very basic toolbox (just command-line tools, known as “bash commands”) and then, as it works, it can invent and add new tools to its toolbox.

Here’s the approach in everyday terms:

Starting small: It begins with the simplest setup (only basic commands) so it’s lightweight and general.
Two choices at each step: Every time it makes a move, it can either: 1) run a command to do something, or 2) create a new tool (a small script) that makes future steps easier.
What is a “tool”? A tool is just a little program (a script) the agent writes and saves. For example, it might create:
- An editor tool that safely replaces text in a file and confirms it worked.
- A search tool that finds code patterns and shows helpful context without flooding the screen.
- A special analyzer for a unique file type (like MARC or Go source files) when the problem demands it.
Reflection loop: After seeing the result of each step, the agent pauses and asks itself, “Would making or improving a tool help me here?” This quick self-check is like a coach whispering, “Is there a faster way?”
No heavy training: Unlike some systems that spend lots of time and money training offline, Live-SWE-agent learns and adapts during the task itself. That keeps costs low and makes it work with different AI models.

Analogy: It’s like fixing a bike with a small starter kit, but if you discover a tricky bolt, you build a custom wrench right then and there—and keep using it for the rest of the repair.

What did the researchers find?

The team tested Live-SWE-agent on well-known benchmarks, which are collections of real coding issues used to measure performance:

SWE-bench Verified (500 human-validated Python issues): Live-SWE-agent solved about 75.4% of the tasks. That’s better than previous open-source agents and close to the best commercial systems.
SWE-Bench Pro (731 harder, more “enterprise-level” issues across multiple languages): Live-SWE-agent solved about 45.8%, which is the best reported open-source result at the time.

Why this matters:

Better problem solving: Creating tools on the fly helps the agent solve more issues and sometimes reduces the number of steps needed.
Low extra cost: It improves results without expensive offline training. Some older “self-improving” agents needed hundreds of hours or even tens of thousands of dollars’ worth of training runs.
Adapts to the problem: The agent builds general tools (like edit/search) and also very specific tools tailored to unusual file formats or languages. That flexibility is key for real-world coding.
Works across AI models: Stronger AI models benefit the most from this approach; weaker models can struggle to invent useful tools reliably. As AI models improve, Live-SWE-agent’s benefits grow.

Why is this important, and what could it lead to?

Less manual design: Instead of engineers handcrafting giant toolboxes ahead of time, the agent builds exactly what it needs at the moment—saving effort and making systems more general.
Better AI evaluation: This method tests both an AI’s ability to fix code and its ability to invent helpful tools. That gives a richer picture of how capable an AI really is.
Broader uses: The same idea could help with writing tests, finding security bugs, analyzing binaries, or optimizing performance—domains where you often need very specialized tools.
Future growth: The agent could learn to evolve not only its tools but also its overall strategy (its “scaffold”) and carry useful tools and skills from one task to the next. It could even be trained to self-evolve during learning, making future AI more adaptable and robust.

In short, Live-SWE-agent shows that letting an AI “craft its own tools while working” is a powerful, simple, and low-cost way to boost performance on real software problems.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, action-oriented list of knowledge gaps, limitations, and open questions the paper leaves unresolved.

Quantify runtime overhead: report wall-clock latency, token usage, and step counts per issue, and isolate the marginal cost/benefit of tool synthesis versus baseline actions.
Safety and sandboxing: design and evaluate permissioning, allowlists/denylists, resource limits, and syscall/network sandboxes for executing agent-generated scripts.
Tool correctness validation: introduce pre-execution checks (unit tests, static analysis, schema/IO contracts) and measure how validation affects solve rates and failures.
Failure-mode analysis: systematically characterize and mitigate loops, degenerately frequent tool creation, redundant/conflicting tools, and harmful edits.
Reflection policy design: paper sensitivity to reflection prompt wording, frequency, and triggers; compare static prompting to learned controllers or bandit-style gating.
Lifecycle management of tools: specify and test policies for naming, versioning, modification, deprecation, and pruning within and across sessions.
Cross-task retention: implement a persistent skill/tool library with retrieval and evaluate transfer, catastrophic forgetting, and overfitting across repositories.
Generality across scaffolds: measure gains when starting from richer initial toolsets (e.g., editors/searchers) and quantify the marginal benefit of on-the-fly tools.
Interplay with test-time scaling: evaluate whether multi-sampling, multi-patch search, and diversified trajectories complement on-the-fly tool synthesis.
Controlled baselines: re-run competing agents under the same LLMs, budgets, and environments to remove confounds from different backends and costs.
LLM robustness boundaries: extend LLM ablations beyond a 50-issue subset to full benchmarks and strong open-source models; identify capability thresholds where tool creation helps vs. harms.
Broader evaluation: test on larger/heterogeneous codebases (monorepos), additional languages (Java/C/C++), OS variations, and SWE-bench-Live/PolyBench/Multi-SWE-bench.
Cost accounting completeness: report compute/memory profiles and token breakdowns (prompt, reflection, tool code) in addition to dollar cost per issue.
Repository isolation: assess side effects of generated tools on repo state and test infrastructure; adopt ephemeral worktrees, shadow copies, or container snapshots.
Dependency management: define how generated tools declare and install dependencies (virtualenvs, lockfiles, offline constraints) and measure the reliability impact.
Security and compliance: analyze risks of data exfiltration, secrets handling, license compliance of generated code, and introduce audit trails or policy checks.
Tool discovery and selection: add retrieval/ranking over created tools (within-session and cross-session) to prevent “tool sprawl” and measure its effect on efficiency.
Error handling and observability: standardize exit codes, structured outputs, telemetry, and logs for generated tools; paper how this affects agent planning and debugging.
Causal attribution: quantify which tool categories drive performance gains via ablations (disable/replace specific tool types) and report effect sizes.
Cost–benefit gating: learn or estimate a decision policy that predicts when to create a tool versus issuing direct commands, optimizing for expected downstream gains.
Beyond tools self-evolution: implement and evaluate online changes to prompts, planner/loop parameters, memory mechanisms, and environment interfaces.
Persistence and sharing: define serialization, versioning, and governance for sharing tools across repos/teams while preventing specialization or degradation.
Human oversight: test approval/review workflows for tool creation and modification; measure trade-offs between autonomy, safety, and performance.
Reproducibility: release exact prompts, seeds, environment images, and evaluation scripts; report variance across runs and sensitivity to hyperparameters.
Metrics beyond resolve rate: measure patch quality (maintainability, style), regression risk, and partial credit/functional coverage to capture nuanced improvements.
Small-model pathways: devise lightweight policies (caps on tool creation, stricter validation, distilled controllers) to mitigate degradation on weaker LLMs.
Reflection scheduling: paper adaptive reflection frequency (event-triggered or uncertainty-based) to balance context/cost with benefits.
Tool clustering rigor: replace qualitative t-SNE with quantitative clustering and correlate cluster membership with task characteristics and success rates.
Build/toolchain complexity: evaluate performance on projects with heavy compilers, flakiness, and long builds; paper caching and incremental builds for practicality.
Real-world CI/code review integration: test end-to-end flows with PR creation, CI gates, reviewer feedback loops, and policy checks.
Data contamination risk: assess proprietary LLM training contamination on SWE-bench-style tasks using controlled splits or contamination-aware models.
Step-limit dynamics: analyze how tool creation affects step counts and hitting limits; adaptively tune limits or planning granularity.
Fair comparisons to workflow systems: benchmark against Agentless/Moatless under identical LLMs and budgets to isolate the contribution of live tool synthesis.
Guardrails for destructive commands: add formal policies or verifiers to prevent unsafe actions (e.g., rm -rf, chmod changes) and quantify prevented incidents.

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper introduces Live-SWE-agent, a minimal, runtime self-evolving software engineering agent that synthesizes and refines its own tools (as executable scripts) while solving real-world software issues. It achieves state-of-the-art open-source performance on SWE-bench Verified (75.4% solve rate without test-time scaling) and SWE-Bench Pro (45.8%). Its core innovation—on-the-fly tool creation with lightweight reflection—removes costly offline “self-improvement” and generalizes across LLMs, repositories, and languages.

Below are practical, real-world applications derived from the paper’s findings, categorized into immediate and long-term opportunities, with sector links, potential products/workflows, and key assumptions or dependencies.

Immediate Applications

These applications can be piloted or deployed now with strong LLMs, sandboxed environments, and integration into existing developer workflows.

Software/DevOps: CI/CD auto-fixer for failing builds and issues
- Product/workflow: A GitHub/GitLab Action that triggers Live-SWE-agent on test failures, synthesizes repo-specific tools (search/edit/analyzers), proposes patches, opens PRs with diffs and test results for human review.
- Dependencies: Access to a strong LLM (e.g., Claude 4.5, GPT-5), reproducible tests, containerized sandbox with script execution, cost budget per task, VCS integration, audit logs.
Enterprise Engineering: Internal “self-evolving” developer assistant
- Product/workflow: A bot that tailors tools to codebase idiosyncrasies (e.g., MARC or domain-specific file analyzers), supports root-cause analysis, rapid patching, and knowledge capture.
- Dependencies: Secure runtime with least privilege; change-management guardrails; access controls; observability of agent actions; alignment with enterprise coding standards.
Open Source Maintenance: Issue triage and patch suggestions
- Product/workflow: Maintainer bot that reproduces issues, searches code, proposes fixes, and surfaces tailored tools for hard-to-inspect artifacts; can tag issues with confidence and open PRs.
- Dependencies: Clear issue descriptions, deterministic test harnesses, contributor license compliance, maintainer approval workflows.
Developer Tools: IDE extensions for on-the-fly tool synthesis
- Product/workflow: VS Code/JetBrains plugin that lets the agent generate project-specific scripts (e.g., context-aware search, safe edit tools) to replace brittle multi-flag shell commands.
- Dependencies: Local sandboxed execution, policy for script persistence, telemetry to avoid privacy leakage, opt-in guardrails.
AI Evaluation (Academia/Industry): Unified LLM benchmarking scaffold
- Product/workflow: Standardized evaluation harness that tests both issue-solving and tool-creation capabilities across models; reproducible tracks on SWE-bench Verified/Pro and internal corpora.
- Dependencies: Access to benchmarks and environments; standardized prompts; consistent cost budgets; seed control for reproducibility.
Cost Reduction: Replace offline self-improving pipelines
- Product/workflow: Swap costly offline evolution (e.g., DGM/HGM pipelines) with Live-SWE-agent to deliver immediate performance gains at a fraction of cost, especially on verified issue sets.
- Dependencies: Reliable LLM backend; organizational acceptance of agent-in-the-loop instead of pretrained scaffolds; cost observability.
Multilingual Development Support: Cross-language code analyzers
- Product/workflow: Agents synthesize lightweight analyzers for Go/TS/JS/Python to find identifiers, imports, and struct/function definitions for fast navigation and refactoring.
- Dependencies: Tolerance for heuristic parsers; not a substitute for full compilers; tests confirming analyzer accuracy on target repos.
DevSecOps (low-hanging vulnerabilities): Rapid config/code hardening
- Product/workflow: Agent triages simple misconfigurations and code-level vulnerabilities (e.g., hardcoded secrets, unsafe defaults) and proposes patches; generates scanning tools tailored to the repo layout.
- Dependencies: Clear vulnerability reproduction; strict guardrails; security review gates; no binary-level or kernel-space actions.
Data Engineering: Schema and ETL fix assistant
- Product/workflow: Agent synthesizes checks and validators (e.g., context-aware search over pipeline repos, schema diff tools) and proposes fixes for broken transformations.
- Dependencies: Access to sanitized data samples; privacy and compliance guardrails; staging environment; lineage/tracking.
Education/Daily Life: Learning and local automation
- Product/workflow: Course labs where students explore agent tool creation; personal desktop tasks where the agent synthesizes scripts for batch file ops, log parsing, and small utilities.
- Dependencies: Safe local sandbox; curated tasks; limited permissions; model access.

Long-Term Applications

These require further research, scaling, or productization—often beyond tool creation to full scaffold self-evolution, persistent learning, and training-time integration.

Skills/Tool Marketplace: Persistent, reusable “skill” library
- Product/workflow: A “Skills Store” that serializes effective tools discovered across tasks; agents load relevant skills on demand (e.g., repo-specific analyzers) and refine them.
- Dependencies: Versioning, provenance tracking, safety vetting, skill selection policies, license compliance.
Full Scaffold Self-Evolution: Agent modifies prompts/workflows
- Product/workflow: Meta-agent that tunes its own loop (system prompt, action selection, feedback strategies) based on live outcomes; automatic schedule for tool creation vs. direct actions.
- Dependencies: Robust self-modification guardrails, automated A/B validation, rollback mechanisms, interpretability/audit trails.
Training-Time Self-Evolution (AI Labs): Integrate into post-training
- Product/workflow: Models learn to synthesize and use tools during training (RL/CoT+tools), capturing richer signals and generalizing across scaffolds.
- Dependencies: Compute budgets; curated curricula; safe training environments; evaluation suites that test tool creation reliably.
Enterprise Modernization: Large-scale codebase migration and optimization
- Product/workflow: Agents orchestrate profilers, tracers, and transformation tools (which they partly synthesize) to migrate frameworks, optimize performance, and reduce tech debt.
- Dependencies: Deep test coverage; performance baselines; production staging; SRE collaboration; extensive guardrails.
Advanced Security: Binary analysis and patch pipelines
- Product/workflow: Agents compose decompilers and dynamic instrumentation—creating glue tools—for vulnerability discovery and remediation in complex binaries.
- Dependencies: Specialized toolchains; strict isolation; legal/regulatory compliance; expert review; high-performance compute.
Robotics/Embedded: Domain-adaptive controllers and data tooling
- Product/workflow: Embodied agents generate task-specific data collection scripts, calibration tools, and simple controllers on the fly during experiments/simulation.
- Dependencies: Real-time constraints; sim-to-real safety; hardware APIs; rigorous testing; fail-safes.
Governance/Policy: Standards for self-modifying AI in production
- Product/workflow: Policies for auditability of auto-created tools, change control, risk classification, and minimal-permission execution; procurement checklists leveraging the evaluation harness.
- Dependencies: Multi-stakeholder consensus; legal frameworks; secure logging; incident response plans; third-party certification.
SaaS Platforms: “Self-evolving CI” offerings
- Product/workflow: Managed CI that automatically synthesizes repo-specific tools and patches; enterprise features for compliance, privacy, and observability.
- Dependencies: Enterprise trust; SOC2/ISO compliance; role-based access control; strong SLAs.
Education/Research: Curricula and benchmarks on tool creation
- Product/workflow: Courses and lab suites focused on runtime tool synthesis, agent reflection, and evaluation; new benchmarks measuring tool utility, efficiency, and generalization.
- Dependencies: Open-source environments; standardized tasks; access to strong models; reproducible setups.
Consumer OS Automation: Auto-Shortcuts/Auto-CLI
- Product/workflow: On-device agents that create bespoke utilities for repetitive tasks (file organization, reports, backups) with a secure approval flow.
- Dependencies: On-device models or private inference; permissions/sandboxing; UX for review and trust; energy/compute constraints.

Cross-cutting assumptions and dependencies

Strong LLMs significantly improve outcomes; weaker models may fail to synthesize useful tools or get stuck (per ablations).
Safe, sandboxed execution with strict permissions is mandatory to mitigate security risks of runtime tool creation.
Deterministic test harnesses and reproducible environments increase success rates and confidence.
Observability, audit logs, and human-in-the-loop review are critical for production adoption.
Licensing, provenance, and compliance must cover synthesized tools and patches, especially in regulated domains.

View Paper Prompt View All Prompts

Glossary

Ablation: An experimental method that removes or alters components to assess their impact on performance. Example: "Our ablations and tooling analyses reveal that (1) custom tool creation materially improves effectiveness (higher solve rates) with minimal overhead"
Action space: The set of operations an agent can take at any point; can be fixed or expanded. Example: "most existing agents have fixed designs and are limited in their static action space"
Agent scaffold: The structural framework (prompts, tools, interfaces, workflows) that defines how an agent operates. Example: "it can be extremely challenging and costly to exhaust the entire agent scaffold design space."
Agentic loop: The iterative decision–execution–feedback cycle an agent follows during problem-solving. Example: "Live-SWE-agent does not make any changes to the agentic loop, impose a particular workflow, or require any costly offline training."
Context window: The maximum amount of text an LLM can consider at once. Example: "only showing the first 20 matches can be critically important to not drastically inflate the context window"
Custom tool: A script the agent synthesizes and executes to extend its capabilities for the current task. Example: "we define a custom tool as a script that can be executed in the environment."
Custom Tool Synthesis: The process by which the agent generates bespoke tools during runtime to aid problem-solving. Example: "Custom Tool Synthesis"
Darwin-GÃ¶del Machine (DGM): A prior self-improving software agent that evolves via costly offline procedures. Example: "including the Darwin-GÃ¶del Machine (DGM)."
Decompiler: A tool that translates binaries back into higher-level representations to aid analysis. Example: "we need binary analysis tools and decompilers."
Embedding: A vector representation of text or code used for similarity or visualization. Example: "We compute the embedding for each tool based on the tool body"
Environmental feedback: The observable outputs (e.g., test results, command outputs) returned by the execution environment. Example: "directly provide the environmental feedback output to the agent"
Huxley-GÃ¶del Machine (HGM): A self-improving agent framework that performs offline evolution of scaffolds and strategies. Example: "Huxley-GÃ¶del Machine (HGM)"
LLM backend: The specific LLM used to power the agent’s reasoning and generation. Example: "benefits persist across different state-of-the-art LLM backends"
MARC (Machine-Readable Cataloging) file: A standardized bibliographic data format used by libraries. Example: "MARC files, a file format for publication or text records"
mini-SWE-agent: A minimal baseline agent that primarily uses bash tools. Example: "we implement Live-SWE-agent on top of the popular mini-SWE-agent \citep{yang2024sweagent} framework."
Offline evolution: Improving an agent by modifying and validating its scaffold using precompiled benchmarks rather than during live runs. Example: "they rely heavily on offline evolution"
Offline training: Costly pre-execution learning phases used to adapt an agent or model to benchmarks. Example: "such self-improving agents require costly offline training on known benchmarks"
On-the-fly self-evolution: The agent’s ability to modify itself (e.g., by creating tools) during runtime while solving a task. Example: "performing practical self-evolution on-the-fly during runtime"
Patch: A code change submitted to fix an issue. Example: "and sample one patch per issue."
Profiler: A performance analysis tool to identify bottlenecks in software systems. Example: "To optimize a large complex system, we need to apply profilers and tracing tools."
Reflection prompt: A targeted prompt that asks the agent to analyze its recent steps and decide whether to create or revise tools. Example: "A lightweight stepâ‘reflection prompt repeatedly asks the agent whether creating or revising a tool would accelerate progress"
Resolve rate: The percentage of issues successfully solved by the agent. Example: "achieves 75.4\% resolve rate on Verified"
Self-Improving Coding Agent (SICA): A prior framework for agents that evolve their strategies via offline improvement. Example: "Self-Improving Coding Agent (SICA)"
Static analyzer: A tool that examines source code without executing it to extract structural or semantic information. Example: "a simple static analyzer for the Go programming language"
SWE-agent: A software engineering agent framework equipped with specialized tools for repository tasks. Example: "SWE-agent \citep{yang2024sweagent}"
SWE-bench Multilingual: A benchmark of software engineering tasks across multiple programming languages. Example: "SWE-bench Multilingual \citep{sbm}"
SWE-bench Verified: A curated benchmark of 500 real-world issues verified by human annotators. Example: "SWE-bench Verified benchmark"
SWE-Bench Pro: A more challenging benchmark capturing enterprise-level software issues across multiple repos and languages. Example: "SWE-Bench Pro \citep{deng2025swe}"
Test-time scaling: Boosting performance by increasing inference-time resources (e.g., more samples or longer chains) rather than changing training. Example: "without test-time scaling"
t-SNE: A dimensionality reduction method used to visualize high-dimensional embeddings. Example: "t-SNE visualization of tools generated by Claude 4.5 Sonnet"
Trajectory: The sequence of states, actions, and observations an agent experiences while solving a task. Example: "reflect on its past trajectory to determine if it should create any tools"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (5)

Collections

Tweets

This paper has been mentioned in 5 tweets and received 38 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? (2511.13646v1)

Summary

Live-SWE-agent: On-the-Fly Self-Evolving Software Engineering Agents

Introduction

Methodology: Live, Online Self-Evolution

Tool Synthesis and Adaptivity

Empirical Evaluation

Analysis of Custom Tooling

Implications and Future Directions

Conclusion

References

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does the paper try to answer?

How does Live-SWE-agent work?

What did the researchers find?

Why is this important, and what could it lead to?

Knowledge Gaps

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

YouTube