Memory-Driven Self-Evolution

Updated 20 February 2026

Memory-Driven Self-Evolution is a framework that leverages non-parametric memory modules to enable agents to learn and adapt during inference without altering core model parameters.
It employs methodologies like episodic utility-guided updates and Darwinian selection to mitigate catastrophic forgetting and optimize performance on diverse tasks.
Empirical results across benchmarks demonstrate improved accuracy, cumulative rewards, and stability, highlighting its potential in continual and reinforcement learning.

Memory-Driven Self-Evolution refers to a class of agent architectures and learning frameworks in which episodic or structural memory modules, external to core model parameters, serve as both the substrate and mechanism for self-improvement during deployment. These systems allow agents to accumulate, organize, retrieve, and revise experiences in a manner that progressively enhances agent competence and adaptability, often entirely at inference or runtime, without altering the underlying parametric backbone (e.g., a frozen LLM). By bridging reinforcement learning, continual learning, and meta-cognitive retrieval frameworks, memory-driven self-evolution addresses the stability–plasticity dilemma and @@@@1@@@@ endemic to traditional fine-tuning, enabling ongoing task adaptation and performance growth across a wide range of environments.

1. Core Concepts and Formal Abstraction

In memory-driven self-evolution systems, an agent typically comprises:

A stable reasoning “core” (often a frozen LLM or policy network)
An episodic or structured external memory $M$ , storing experiences, strategies, or extracted skills
A memory retrieval and ranking policy $p(m|s, M)$ , implemented non-parametrically
Update rules that evolve $M$ ’s contents (utilities, structure, policies) in response to trial-and-error interaction

A canonical formalism involves a tuple $(\mathcal{F},\mathcal{U},\mathcal{R},\mathcal{C})$ :

$\mathcal{F}$ : forward model (inference/generation)
$\mathcal{R}$ : retrieval function, e.g., selects top- $K$ memory entries relevant to the current context
$\mathcal{C}$ : contextualization/synthesis, amalgamating retrieved memory with the current input for reasoning or action
$\mathcal{U}$ : update function, which evolves $M$ in response to the outcome and feedback of current trials (Wei et al., 25 Nov 2025)

Typical memory elements are triplets or tuples $(z, e, Q)$ , where $z$ is an embedding of the past state/intention, $e$ is the raw experience (e.g., trace or action plan), and $Q$ is a scalar utility reflecting past task performance when $e$ was applied to contexts like $z$ (Zhang et al., 6 Jan 2026).

2. Principal Architectures and Methodologies

Episodic Utility-Guided Memory (MemRL)

MemRL (Zhang et al., 6 Jan 2026) epitomizes memory-driven self-evolution, decoupling a frozen LLM “reasoning module” from a plastic, non-parametric episodic memory. Each episode proceeds as follows:

Intent Encoding: Incoming query/task $s_t$ is embedded into $z_t$ .
Two-Phase Retrieval: Memory $M$ is filtered by cosine similarity to $s_t$ (semantic phase), then scored by a composite of similarity and learned Q-value (utility phase). The highest-scoring experiences are retrieved.
Contextualized Inference: The LLM generates, conditioning on $M_{ctx}(s_t)$ .
Environmental Feedback: Post-execution reward $r_t$ is used to update $Q$ values for all consumed entries using an EMA or Bellman-style rule:

$Q_j \leftarrow Q_j + \alpha (r_t - Q_j)$

Memory Expansion: New experience summaries are appended as fresh triplets (Zhang et al., 6 Jan 2026).

The retrieval-and-update policy is theoretically justified (convergence of $Q$ under EMA, variational lower bound monotonicity under policy improvement, and Bellman contraction), and empirically yields marked accuracy and cumulative reward gains across diverse benchmarks, outperforming non-evolving memory and RAG alternatives.

Utility-Driven Natural Selection (Darwinian Memory)

DMS (Mi et al., 30 Jan 2026) implements memory evolution as an ecosystem, where compositional macro-action units compete for retention based on a survival value:

$S(m_i) = U(n_i) \cdot D(\Delta t, n_i) \cdot P(K_i)$

with

Usage utility: $U(n_i) = \ln(1 + n_i) + v_{\text{new}}$
Temporal decay: $D(\Delta t, n_i) = \frac{1}{1+\exp[\beta(\Delta t - T_{\text{half}}(n_i))]}$
Reliability: $P(K_i) = 1/(1 + \gamma K_i)$

High-fitness entries survive; stagnant or risky plans are pruned using an “elbow method.” This model supports $\varepsilon$ -mutation and Bayesian risk feedback, yielding strong improvements in task success and stability for GUI agents, all in a training-free pipeline.

Continuous Feedback and Guideline Synthesis (Live-Evo)

Live-Evo (Zhang et al., 2 Feb 2026) decouples raw experience storage from meta-guideline formation. Experiences are assigned scalar weights, incremented or decremented after contrastive evaluation (with/without guideline). The pipeline:

Retrieval of top-k experiences (weighted by historical success) plus a meta-guideline
Guideline compilation via an LLM
Evaluation, weight adjustment, and meta-guideline expansion as needed

Empirically, guideline gain $\Delta r$ controls reinforcement/decay:

$w_e^{\text{new}} = w_e^{\text{old}} + \Delta r$

This approach achieves substantial Brier score and return gains in live forecasting streams.

Skill Evolution and Controller-Designer Loops (MemSkill)

MemSkill (Zhang et al., 2 Feb 2026) replaces static memory operations (insert, update, delete, skip) by a library of skills, each parameterized and learned end-to-end. The controller selects a top- $K$ subset of skills per context span using Gumbel-Top-K-based policies, trains them via PPO, and evolves the skill set by LLM-driven “Designer” module reading hard-case buffers. Skill evolution involves:

Scoring failed queries by $d(q) = [1 - r(q)] \cdot c(q)$
Clustering hard cases and using LLM to generate new/refined skills

Across benchmarks, MemSkill improves F1 and step-efficiency, with skills specializing over time toward domain-specific memory extraction.

Generative/Latent Memory (MemGen)

MemGen (Zhang et al., 29 Sep 2025) incorporates memory as a latent token sequence generated dynamically by a memory weaver module, each time a memory trigger signals need. This framework supports emergent planning, procedural, and working memory faculties, with no fixed external database.

Meta-Evolution of Memory Architectures (MemEvolve)

MemEvolve (Zhang et al., 21 Dec 2025) generalizes self-evolution by evolving not just memory contents, but the modular memory architecture itself (encode/store/retrieve/manage), using a dual bilayer meta-evolutionary algorithm. This population-based search optimizes architecture pools by fitness over episodic experience, yielding architectures that generalize across tasks and backbones.

3. Theoretical Properties and Stability–Plasticity Tradeoffs

A central challenge in memory-driven self-evolution is reconciling stable core reasoning (“stability”) with the ability to incorporate new knowledge and strategies (“plasticity”). Theoretical results in MemRL (Zhang et al., 6 Jan 2026) give:

Local convergence:

$\mathbb{E}[Q_t - B] = (1 - \alpha)^t (Q_0 - B)$

for average reward $B(s,m)$

Variance control:

$\limsup \text{Var}(Q_t) \leq \alpha \sigma^2/(2-\alpha)$

Global improvement via variational EM-like maximization of a lower bound $J(p,Q)$ , supported by Bellman contraction.

Empirical results consistently show that self-evolving memory with value-aware filtering grows cumulative accuracy without catastrophic forgetting, and measured forgetting rates are lower than in passive or hand-engineered memory systems (Zhang et al., 6 Jan 2026, Mi et al., 30 Jan 2026).

4. Benchmarks, Metrics, and Empirical Performance

Memory-driven self-evolution frameworks have been evaluated on a range of code generation, embodied task, knowledge reasoning, and GUI automation benchmarks, including BigCodeBench, ALFWorld, Lifelong Agent Bench, and Humanity’s Last Exam (HLE). Metrics include:

Last-epoch accuracy
Cumulative success rate (CSR)
Pass@k (codegen)
Success retention and reuse rate (GUI)
Task completion rates under composition and generalization

Representative performance table (adapted from MemRL (Zhang et al., 6 Jan 2026)):

Method	CodeGen Last / CSR	OS Task	DB Task	Exploration	HLE Knowledge
No Memory	0.485 / –	–/0.577	–/0.756	0.278/0.462	0.357/0.524
RAG	0.475/0.483	0.690/0.700	0.914/0.916	0.370/0.415	0.430/0.475
MemP	0.578/0.602	0.736/0.742	0.960/0.966	0.324/0.456	0.528/0.582
MemRL	0.595/0.627	0.794/0.816	0.960/0.972	0.507/0.697	0.573/0.613

In ALFWorld exploration, MemRL improves last accuracy from 0.324→0.507 (+56%) and CSR from 0.456→0.697. Ablations demonstrate key contribution of utility-driven retrieval, normalization, and balanced top-K filtering; Q-values correlate strongly with observed success.

5. Limitations, Open Challenges, and Future Directions

Self-evolving memory systems exhibit several limitations:

Generalization is constrained in low-similarity or adversarial domains; strict Q-based policies may surface “high-value” but irrelevant or out-of-distribution entries (Zhang et al., 6 Jan 2026).
Fixing schedule parameters (e.g., $\lambda$ , $K$ , learning rate) can yield suboptimal trade-offs between semantic matching and utility.
Current implementations use single-step value propagation; long-horizon temporal credit propagation may be less efficient than full TD-learning or experience replay.
Memory growth is unbounded in some regimes; condensation and distillation are active areas for research.
Learnable retrieval policies, hierarchical clustering, and meta-learned relevance/utility trade-offs have been proposed as improvements.

These frameworks can be extended by (i) parameterizing retrieval with small networks, (ii) deploying multi-resolution or structured memory banks, (iii) distilling or pruning memory to limit resource usage, and (iv) incorporating meta-evolution of memory architectures themselves (Zhang et al., 21 Dec 2025, Zhang et al., 2 Feb 2026).

Memory-driven self-evolution frameworks differentiate themselves via:

Runtime, interaction-driven evolution as opposed to offline fine-tuning, which incurs computational burden and catastrophic forgetting
Non-parametric, reinforcement-driven updates of memory, enabling flexible, plastic adaptation while preserving frozen parametric reasoning (Zhang et al., 6 Jan 2026, Mi et al., 30 Jan 2026)
Compositional and skill-based abstractions (DMS, MemSkill) moving beyond static, heuristic pipelines (Mi et al., 30 Jan 2026, Zhang et al., 2 Feb 2026)
Modular and meta-evolutionary search over the memory architecture itself (Zhang et al., 21 Dec 2025)

A persistent theme is the shift from pure retrieval-augmented methods—prone to semantic drift and retrieval noise—to value-aware, task-adaptive, and contextually compositional memory, with demonstrable gains on both synthetic and complex real-world tasks (Zhang et al., 6 Jan 2026, Zhang et al., 2 Feb 2026, Zhang et al., 2 Feb 2026, Zhang et al., 29 Sep 2025).

7. Impact and Significance

Memory-driven self-evolution constitutes a key advance in the quest for continually adaptive, robust, and scalable AI agents. By externalizing exploration, exploitation, and utility-weighted learning from the model parameters to plastic memory substrates, these systems reconcile stability and plasticity, enable lifelong learning in open-domain settings, and approach the behavioral flexibility seen in human constructive simulation. Extensive empirical results confirm that agents equipped with evolving memory not only learn faster, but sustain and transfer higher-level knowledge and strategies, moving the field beyond the paradigm of static, non-adaptive inference-time AI (Zhang et al., 6 Jan 2026, Zhang et al., 21 Dec 2025, Mi et al., 30 Jan 2026, Zhang et al., 29 Sep 2025, Zhang et al., 2 Feb 2026).