Continuous Self-Evolving Agent Training

Updated 4 July 2026

Continuous self-evolving agent training is a paradigm where agents update mutable substrates, such as memories, skills, and prompts, to adapt autonomously without fixed task boundaries.
Approaches in this field include modular reinforcement learning, memory-centric evolution, and dynamic world-model adaptations that balance rapid learning with long-term retention.
Empirical evidence shows that incorporating explicit preservation mechanisms mitigates catastrophic forgetting and enhances performance across both simple and complex task domains.

Searching arXiv for recent work on continuous self-evolving agents and related continual-learning agent frameworks. Continuous self-evolving agent training denotes a family of methods in which an agent improves across ongoing interaction streams by updating a mutable capability substrate—such as modules, memories, workflows, skills, prompts, world models, or parameters—rather than remaining a static test-time executor. Across continual reinforcement learning, embodied control, web navigation, GUI automation, productivity agents, service dialogue, education simulation, and counseling, the common objective is to preserve useful prior competence while acquiring new behaviors under changing task distributions, often without human-supplied task boundaries or large-scale annotation (Powers et al., 2022, Zhang et al., 2 Feb 2026, Feng et al., 9 Feb 2025, Dong et al., 20 Apr 2026).

1. Conceptual scope and historical trajectory

The field spans at least two distinct lineages. One lineage originates in continual and evolutionary reinforcement learning. An early example is the self-training autonomous driving agent STAD, which combined reinforcement learning, evolutionary strategies, and a World Models-style stacked architecture for OpenAI CarRacing-v0, and introduced difference images in the autoencoder to bias the latent space toward motion-relevant information (Kotyan et al., 2019). A later continual-RL example is Self-Activating Neural Ensembles (SANE), which replaced a single monolithic policy with a dynamic ensemble of actor-critic modules and used selective updating to avoid catastrophic forgetting without assuming task IDs or explicit task boundaries (Powers et al., 2022).

A second lineage centers on LLM agents and treats deployment itself as a continual learning regime. In this view, the agent does not merely retrieve past context; it accumulates experience, distills reusable procedures, adapts memory selection, and may even revise the optimizer that edits prompts. Experience-driven Lifelong Learning (ELL) formalizes this by defining a lifelong sequence of tasks, a knowledge state $\mathcal{K}=(\mathcal{M},\mathcal{F})$ composed of memory and skills, and a learning operator $\Phi_{\text{learn}}$ that can add, update, delete, or combine knowledge over time (Cai et al., 26 Aug 2025).

This broadening of scope has shifted the meaning of “training.” In some systems, training still refers to explicit policy or world-model updates, as in EvoAgent’s continual World Model updates or SEAgent’s GRPO-based policy refinement (Feng et al., 9 Feb 2025, Sun et al., 6 Aug 2025). In others, training is implemented through structured memory evolution, reflective context engineering, or prompt evolution without changing backbone weights, as in MetaAgent, MUSE, MOBIMEM, AutoAgent, and SePO (Qian et al., 1 Aug 2025, Yang et al., 9 Oct 2025, Liu et al., 15 Dec 2025, Wang et al., 10 Mar 2026, Tao et al., 3 Jun 2026). The concept is therefore best understood as an umbrella over multiple continual adaptation mechanisms rather than a single algorithmic template.

2. Mutable substrates and recurrent training loops

A unifying feature of the literature is that each system specifies what is allowed to evolve. Capability-Preserving Evolution (CPE) makes this explicit by modeling self-evolving agents as maintaining a mutable capability repository $R_t$ , which may be an executable workflow, a bounded skill bank, model parameters or LoRA adapters, or an external memory store (Yu et al., 10 May 2026).

Mutable substrate	Representative systems	Core update pattern
Modular policy components	SANE	activate one module, update only that module
External memory and experience	Live-Evo, MUSE, MOBIMEM, AutoAgent	retrieve, act, reflect, rewrite or reweight memory
Skills and workflows	PsychAgent, CPE, MetaAgent	extract, merge, preserve, or refine procedural knowledge
World models and environments	EvoAgent, WebEvolver, Agent-World	update dynamics knowledge and synthesize targeted tasks
Prompts and optimizer prompts	SePO	evolve task prompts and the prompt agent’s own prompt

Although concrete implementations differ, the operational loops show strong structural parallels. Live-Evo uses a four-stage loop—Retrieve, Compile, Act, and Update—in which the Experience Bank stores structured historical prediction experiences, the Meta-Guideline Bank stores higher-level synthesis rules, and experience retrieval is scored by $\text{Score}=\text{Weight}\times \text{Sim}(\text{exp},\text{query})$ before weights are updated from the “memory-on vs. memory-off” outcome gap (Zhang et al., 2 Feb 2026). MUSE expresses the same general pattern as $\text{Plan} \rightarrow \text{Execute} \rightarrow \text{Reflect} \rightarrow \text{Memorize} \rightarrow \text{Replan}$ , but organizes memory hierarchically into strategic, procedural, and tool memory (Yang et al., 9 Oct 2025). AutoAgent compresses the runtime control cycle further into Select, Execute, and Update, while its Elastic Memory Orchestrator preserves raw traces, compressed summaries, and episodic abstractions (Wang et al., 10 Mar 2026).

The role of the evolving substrate determines the style of adaptation. Memory-centric systems make future behavior depend on retrieved traces, compiled guidelines, or cached actions; modular continual-RL systems make future behavior depend on expert selection and expert-local parameter updates; world-model systems make future behavior depend on progressively improved predictions of environment dynamics; prompt-evolution systems make future behavior depend on increasingly effective natural-language instructions (Powers et al., 2022, Feng et al., 9 Feb 2025, Tao et al., 3 Jun 2026).

3. Principal methodological families

One family emphasizes update isolation and modular specialization. SANE defines each module $\mathcal{M}_i$ as a policy $\pi_i(a\mid s)$ , a critic $V_i(v,u\mid s)$ , and a replay buffer $\mathcal{B}_i$ , computes upper and lower confidence bounds from the critic’s value and uncertainty estimates, and greedily activates the module with the largest upper confidence bound at the start of each trajectory (Powers et al., 2022). Drift detection against a frozen anchor critic determines whether to clone the active module, while merge operations keep the ensemble within a fixed budget. The central anti-forgetting mechanism is simple: only the activated module is updated.

A second family emphasizes memory as the primary locus of self-evolution. Live-Evo learns online from continuous feedback by decoupling “what happened” from “how to use it,” reinforcing experiences that improve forecasting and down-weighting stale or misleading ones; it also stores new experiences only under a “verify before write” rule (Zhang et al., 2 Feb 2026). MetaAgent begins from a minimal workflow consisting of autonomous reasoning and adaptive help-seeking, then accumulates verified reflection, self-reflection, and a persistent local knowledge base built from tool-use history (Qian et al., 1 Aug 2025). MUSE stores strategic memory as $\langle \text{Dilemma}, \text{Strategy} \rangle$ pairs, procedural memory as hierarchical SOPs, and tool memory as static descriptions plus dynamic instructions (Yang et al., 9 Oct 2025). MOBIMEM decomposes post-deployment evolution into Profile Memory, Experience Memory, and Action Memory, coupled with OS-inspired services such as a scheduler, AgentRR, and context-aware exception handling (Liu et al., 15 Dec 2025). AutoAgent extends this line by treating cognition itself as an explicit prompt-level state over tools, self-capabilities, peer expertise, and task knowledge, and by continuously rewriting that cognition from intention–outcome alignment analysis (Wang et al., 10 Mar 2026).

A third family emphasizes world-model-based or environment-coevolutionary training. EvoAgent combines a memory-driven planner, a WM-guided action controller, and an experience-inspired reflector, then updates a continual World Model with curriculum-selected experiences and a Fisher-regularized loss (Feng et al., 9 Feb 2025). WebEvolver trains a co-evolving World Model LLM to predict next web observations, uses it both as a virtual web server for synthetic trajectory generation and as an imagination engine for World Model Look-Ahead during inference, and updates both policy and world model on self-collected trajectories (Fang et al., 23 Apr 2025). SEAgent extends self-evolution to GUI agents through a World State Model for step-wise trajectory assessment, a Curriculum Generator that expands a guidebook memory, and a training objective that combines adversarial imitation on failure actions with GRPO on successful actions (Sun et al., 6 Aug 2025). Agent-World scales the environment side itself: dynamic evaluation task synthesis, diagnosis of weak environments, and targeted retraining form a self-evolving arena in which agent policies and executable environments co-evolve (Dong et al., 20 Apr 2026).

A fourth family emphasizes self-generated curricula, simulation loops, or self-referential optimizers. SEAD decouples user simulation into a Profile Controller and a User Role-play Model, so that service-dialogue training evolves by shifting the distribution of initial user states toward “golden” scenarios with completion rates near $\Phi_{\text{learn}}$ 0 instead of training an adversarial user (Dai et al., 3 Feb 2026). AgentEvolver replaces handcrafted task sets with self-questioning, self-navigating, and self-attributing, so that the model itself generates proxy tasks, reuses summarized experiences, and assigns denser credit within trajectories (Zhai et al., 13 Nov 2025). SePO closes a different loop by treating the prompt agent’s own system prompt as an optimization target and evolving it via archive-based search alongside downstream task prompts (Tao et al., 3 Jun 2026). In simulation-first settings, AI-Agent School (AAS) uses the Zero-Exp strategy and a continuous “experience-reflection-optimization” cycle over experience and knowledge bases with short-term and long-term memory components, while Parental Guidance treats policies as lineages that reproduce, inherit via behavior cloning, and then surpass parents through PPO-style improvement (Jin et al., 13 Oct 2025, Zhang et al., 24 Mar 2025).

4. Stability, forgetting, and the preservation problem

A central result of the literature is that self-evolution is not inherently cumulative. The capability-erosion analysis shows that adapting to a new task distribution can degrade previously acquired capabilities across workflow, skill/tool, model, and memory evolution. CPE formalizes this as a stability–plasticity problem and replaces naïve stagewise optimization with

$\Phi_{\text{learn}}$ 1

where the regularizer penalizes destructive drift away from previously useful structure (Yu et al., 10 May 2026).

The empirical findings are explicit. In workflow evolution under GPT-5.1 optimization, CPE improves retained simple-task performance from $\Phi_{\text{learn}}$ 2 to $\Phi_{\text{learn}}$ 3 while also increasing complex-task performance from $\Phi_{\text{learn}}$ 4 to $\Phi_{\text{learn}}$ 5. In memory evolution, average old-domain performance drops from $\Phi_{\text{learn}}$ 6 before new-task evolution to $\Phi_{\text{learn}}$ 7 after vanilla memory updates, whereas CPE raises retained performance to $\Phi_{\text{learn}}$ 8 (Yu et al., 10 May 2026). These results directly contradict the common assumption that autonomous refinement is necessarily monotonic.

Selective-update architectures offer one route to preservation. SANE avoids catastrophic forgetting because later training does not overwrite inactive experts: only the currently activated module is updated, while unused modules remain unchanged (Powers et al., 2022). EvoAgent addresses the same issue at the world-model level by selecting informative experiences and regularizing parameter movement with a diagonal Fisher penalty during World Model updates (Feng et al., 9 Feb 2025). Memory-centric systems address interference differently. Live-Evo changes retrieval frequency by reweighting experiences from outcome gaps and refuses to store new summaries unless they improve performance over the original memory-on score, thereby guarding against memory pollution (Zhang et al., 2 Feb 2026). PsychAgent divides the preservation problem into memory continuity, skill evolution, and rejection fine-tuning, so that longitudinal counseling quality depends on both explicit historical state and selective internalization of successful trajectories (Yang et al., 1 Apr 2026).

The preservation problem also reveals a trade-off. Resource-bounded merging, overly aggressive skill replacement, or unconstrained memory insertion can reintroduce destructive interference. SANE explicitly reports stronger forgetting when too few modules are available, and CPE shows that bounded skill banks and mutable memory stores can suppress earlier useful capabilities if preservation is not enforced (Powers et al., 2022, Yu et al., 10 May 2026).

5. Benchmarks, metrics, and empirical regularities

The evaluation landscape is unusually heterogeneous, reflecting the breadth of the field. Benchmarks include Procgen continual-RL sequences, OpenAI CarRacing-v0, Minecraft, OS-World, AndroidWorld, Mind2Web-Live, WebVoyager, GAIA-web, Prophet Arena, Xbench-DeepResearch, TAC, AppWorld, BFCL v3, PsychEval, StuLife, and Agent-World’s synthesized ecosystem of 1,978 retained environments and 19,822 distinct tools (Kotyan et al., 2019, Feng et al., 9 Feb 2025, Sun et al., 6 Aug 2025, Zhang et al., 2 Feb 2026, Yang et al., 9 Oct 2025, Zhai et al., 13 Nov 2025, Yang et al., 1 Apr 2026, Cai et al., 26 Aug 2025, Dong et al., 20 Apr 2026).

Domain	Representative finding	Paper
Continual RL in Procgen	SANE showed strongest retention on Climber, outperformed baselines on the first three Miner environments, and had mixed but strong forgetting scores on Fruitbot	(Powers et al., 2022)
Live forecasting	Brier score improved from 0.19 to 0.14 and market return from 1.24 to 1.46	(Zhang et al., 2 Feb 2026)
Novel-software computer use	Success rate improved from 11.3% to 34.5% over UI-TARS	(Sun et al., 6 Aug 2025)
Long-horizon productivity	TAC results reached $\Phi_{\text{learn}}$ 9, Avg. $R_t$ 0, and PCR $R_t$ 1	(Yang et al., 9 Oct 2025)
Long-horizon embodied tasks	EvoAgent reported a 105% average success-rate improvement and more than 6x reduction in ineffective actions	(Feng et al., 9 Feb 2025)
Multi-round environment co-evolution	Two self-evolution rounds improved $R_t$ 2-Bench, BFCL-V4, and MCP-Mark for both Agent-World-14B and EnvScaler-8B	(Dong et al., 20 Apr 2026)

The metrics used are correspondingly diverse. Continual-RL work emphasizes forgetting statistics such as $R_t$ 3; forecasting uses multiclass Brier score and market return; streaming memory benchmarks evaluate answer accuracy, success rate, step efficiency, and sequence robustness; lifelong-learning benchmarks such as StuLife introduce Average Performance, Average Incremental Performance, Forgetting Measure, Backward Transfer, and Forward Transfer (Powers et al., 2022, Zhang et al., 2 Feb 2026, Wei et al., 25 Nov 2025, Cai et al., 26 Aug 2025).

Despite the diversity, several regularities recur. First, repeated interaction often yields genuine improvement over time. MUSE reports monotonically increasing $R_t$ 4 and $R_t$ 5 across three sequential iterations on a continual-learning TAC subset, and Live-Evo outperforms its base forecasting agent across a 10-week horizon (Yang et al., 9 Oct 2025, Zhang et al., 2 Feb 2026). Second, self-improvement often plateaus unless exploration or environment diversity is enriched. WebEvolver shows that ordinary self-improvement iterations stagnate, whereas a co-evolving world model restores progress; Agent-World likewise reports monotonic but diminishing gains across self-evolution rounds (Fang et al., 23 Apr 2025, Dong et al., 20 Apr 2026). Third, stronger performance usually requires more than raw memory accumulation. Evo-Memory finds that simple experience reuse already helps, but ReMem’s action–think–memory refine pipeline becomes strongest when memory is actively reorganized and pruned (Wei et al., 25 Nov 2025).

6. Limitations, misconceptions, and emerging directions

A recurring misconception is that “self-evolving” implies guaranteed monotonic improvement. The capability-erosion results directly reject that view, showing degradation across all major evolution channels unless explicit preservation mechanisms are added (Yu et al., 10 May 2026). A second misconception is that self-evolving training necessarily means parameter updates. A large subliterature instead keeps weights fixed and evolves memory, cognition, or prompts: MetaAgent relies on reflection and persistent tool-use history, MOBIMEM updates structured memories and runtime artifacts, MUSE grows hierarchical memory, AutoAgent rewrites cognition, and SePO evolves system prompts (Qian et al., 1 Aug 2025, Liu et al., 15 Dec 2025, Yang et al., 9 Oct 2025, Wang et al., 10 Mar 2026, Tao et al., 3 Jun 2026).

A third misconception is that memory accumulation alone solves continual learning. Several papers document the opposite. Live-Evo treats stale or misleading experiences as an active failure mode and combats them with weight decay and “verify before write”; CPE shows retrieval competition and destructive memory updates in Dynamic Cheatsheet-style memory evolution; Evo-Memory shows that naïvely storing failures can pollute memory unless refinement is explicit (Zhang et al., 2 Feb 2026, Yu et al., 10 May 2026, Wei et al., 25 Nov 2025).

The literature is also explicit about technical limits. SANE works best when tasks are distinguishable from the initial observation and acknowledges limited transfer across isolated modules (Powers et al., 2022). WebEvolver reports rollout drift as world-model depth increases, with intrinsic quality degrading beyond shallow look-ahead depths (Fang et al., 23 Apr 2025). Agent-World shows positive but diminishing returns from more environments and more self-evolution rounds, while also implying substantial computational cost for environment synthesis, sandbox execution, diagnosis, and continued RL (Dong et al., 20 Apr 2026). SEAD demonstrates that training the user simulator adversarially can produce reward hacking and unrealistic behavior, which is why it fixes the User Role-play Model and adapts only the profile distribution (Dai et al., 3 Feb 2026).

Taken together, these results suggest a more constrained interpretation of the field. Continuous self-evolving agent training is not a single recipe for autonomous improvement, but a design space defined by four persistent requirements: targeted exploration or task synthesis, selective retention of useful prior structure, explicit control of interference or forgetting, and a mechanism for converting validated interaction history into reusable competence. A plausible implication is that robust long-horizon systems will increasingly combine external memory evolution, preservation-aware updates, dynamically generated curricula, and some form of internalization—whether into skills, prompts, world models, or parameters—rather than relying on any one mechanism alone (Cai et al., 26 Aug 2025, Yang et al., 1 Apr 2026).