Papers
Topics
Authors
Recent
Search
2000 character limit reached

Episodic Utility-Guided Memory (MemRL)

Updated 25 March 2026
  • Episodic Utility-Guided Memory (MemRL) is a reinforcement learning paradigm that integrates non-parametric episodic memory with learned utility signals to drive adaptive recall.
  • It employs a two-phase retrieval process—first filtering by semantic similarity then ranking by Q-value—to effectively select high-value experiences without backpropagation through time.
  • Empirical benchmarks demonstrate MemRL's rapid convergence and improved task success across multi-step challenges compared to baseline methods.

Episodic Utility-Guided Memory (MemRL) refers to a class of reinforcement learning frameworks that couple non-parametric episodic memory with learned utility signals, allowing agents to selectively recall and leverage high-value past experiences to drive continual, online adaptation. This paradigm has been instantiated in both modular language-agent architectures and low-level state-based reinforcement learning systems, characterized by explicit credit assignment to stored memories and runtime utility optimization without recurrent neural network backpropagation through time. Key references for this area include recent advances in LLM-centered memory-augmented agents (Zhang et al., 6 Jan 2026) and utility-weighted reservoir episodic memory for deep RL (Young et al., 2018).

1. Formal Definition and Objective

MemRL maintains a non-parametric episodic memory M\mathcal{M}, structured as a set of experience tuples augmented with learned utility values. In the LLM-agent context, each memory instance is (zi,ei,Qi)(z_i, e_i, Q_i), where ziRdz_i \in \mathbb{R}^d represents an intent or semantic embedding, eie_i is a raw experience (such as a solution trace), and QiQ_i is the learned state-action value associated with retrieving eie_i under queries similar to ziz_i. At each episode, the agent receives a query or state sts_t and selects a memory context mMtm \in \mathcal{M}_t following a retrieval policy p(mst,Mt)p(m|s_t, \mathcal{M}_t), with the objective of maximizing the expected cumulative discounted reward:

maxp  Es0,m0,y0,r0,[t=0γtrt]\max_{p} \; \mathbb{E}_{s_0, m_0, y_0, r_0, \ldots} \left[ \sum_{t=0}^\infty \gamma^t r_t \right]

where the generation policy is composed as:

π(ytst,Mt)=mMtp(mst,Mt)  πLLM(ytst,m)\pi(y_t|s_t,\mathcal{M}_t) = \sum_{m \in \mathcal{M}_t} p(m|s_t, \mathcal{M}_t) \; \pi_{\rm LLM}(y_t | s_t, m)

The goal is to learn p(ms,M)p(m|s, \mathcal{M}) using reinforcement signals so as to optimize long-term task success without updating the core reasoning model weights (Zhang et al., 6 Jan 2026).

2. Memory Architecture and Component Decomposition

MemRL frameworks are modular, with two distinct segments:

  • Frozen Backbone Policy: In LLM-based agents, this is a fixed LLM providing the generative policy πLLM(ys,m)\pi_{\rm LLM}(y|s, m). Weights are never updated, ensuring stability and eliminating catastrophic forgetting.
  • Non-Parametric Episodic Memory: A memory bank containing tuples (embedding, experience, utility), implementing the retrieval policy via learned Q-values.

A typical runtime cycle consists of:

  1. Observing the query/state ss.
  2. Two-phase retrieval to filter and score relevant episodic memories.
  3. Passing the selected memory context to the backbone for action or response generation.
  4. Receiving scalar reward feedback.
  5. Updating memory utilities and optionally appending new summarized experience. No gradient updates are performed on the backbone; all learning is via non-parametric memory (Zhang et al., 6 Jan 2026, Young et al., 2018).

3. Utility-Guided Retrieval and Memory Update

The hallmark of MemRL is its utility-driven retrieval process.

Two-Phase Retrieval (LLM-agent context (Zhang et al., 6 Jan 2026)):

  • Phase A: Semantic relevance filtering via cosine similarity between query ss and memory embedding ziz_i. Top-k1k_1 semantically relevant candidates C(s)C(s) are selected (with threshold θ\theta for sparsity).
  • Phase B: Candidates are further ranked by a composite score combining normalized similarity and learned utility value Q(zi,ei)Q(z_i,e_i):

score(s,zi,ei)=(1β)norm(sim(s,zi))+βQ(zi,ei)\mathrm{score}(s, z_i, e_i) = (1-\beta) \, \mathrm{norm}(\mathrm{sim}(s, z_i)) + \beta Q(z_i, e_i)

with β\beta controlling the trade-off. Top-k2k_2 memories by score form the retrieval set Mctx(s)\mathcal{M}_{\rm ctx}(s).

Q-value Bellman Update:

In one-step RL settings, Q-values are updated via the Monte Carlo target:

Qnew=Qold+α(rQold)Q_{\rm new} = Q_{\rm old} + \alpha (r - Q_{\rm old})

where α\alpha denotes the learning rate.

Reservoir Sampling Variant (Young et al., 2018):

Here, the memory buffer has a fixed size and each state is written with utility-driven weight wt(0,1)w_t \in (0,1) issued by a learnable write network, ensuring the stored subset forms a weighted reservoir sample over all observed states. Write utility is directly boosted when recalled states yield positive TD error, via gradient estimates of the form δt/wi\delta_t / w_i (for the queried slot), enabling precise, single-step credit assignment for what to remember.

4. Learning Process and Algorithmic Workflow

MemRL algorithms universally feature online, trial-and-error credit propagation to the memory utility table, rather than the parametric base model.

LLM-agent Learning Loop (Zhang et al., 6 Jan 2026):

1
2
3
4
5
6
7
8
9
10
11
12
Initialize M = 
for each episode:
    observe s
    candidates = Phase_A(s)
    M_ctx = Phase_B(candidates)
    y = LLM(s, M_ctx)
    r = observe_reward()
    for m in M_ctx:
        Q[m] += α * (r - Q[m])
    e_new = summarize(s, y, r)
    z_new = embed(s)
    add (z_new, e_new, Q_init=r) to M
The memory grows over time, with utility-guided pruning an open challenge.

Reservoir Sampling Variant (Young et al., 2018):

A write network emits weights wtw_t for each observed state. When a memory is queried and proves useful (high TD error), only the corresponding wiw_i is updated, which is accomplished without backpropagation through time. ReservoirUpdate maintains the correct weighted-sampling distribution with O(n)O(n) cost.

5. Stability–Plasticity Tradeoff and Convergence

MemRL explicitly disentangles stability from plasticity by freezing the core model and continuously adapting only the non-parametric memory. Theoretical analysis in (Zhang et al., 6 Jan 2026) demonstrates:

  • Q-value estimates for each (s,m)(s,m) pair converge exponentially fast to the stationary expected reward when using a constant α\alpha.
  • The variance of the Q-estimate is bounded by α2αVar(r)\frac{\alpha}{2-\alpha} \mathrm{Var}(r).
  • Global convergence is guaranteed via a Generalized EM framework, interpreting phase-B retrieval as the E-step and Q-update as the M-step; monotonic improvement holds and the retrieval policy reaches a stationary point. This addresses the stability–plasticity dilemma found in neural network agents, achieving strong retention (no catastrophic forgetting) in the backbone and high plasticity in the episodic store.

6. Empirical Performance and Benchmarks

MemRL has been evaluated in a diverse suite of benchmarks:

Benchmark No Memory RAG MemP MemRL
BigCodeBench – / 0.577 0.475/0.483 0.578/0.602 0.595/0.627
Lifelong-OS – / 0.756 0.690/0.700 0.736/0.742 0.794/0.816
Lifelong-DB – / 0.928 0.914/0.916 0.960/0.966 0.960/0.972
ALFWorld – / 0.462 0.370/0.415 0.324/0.456 0.507/0.697
HLE – / 0.524 0.430/0.475 0.528/0.582 0.573/0.613

MemRL excels especially on multi-step tasks (e.g., +24.1 pp CSR in ALFWorld vs utility-matching baseline MemP), but shows consistent improvements across both single-turn and multi-step environments. Freezing the learned memory for transfer robustly outperforms baseline agents on held-out tasks (Zhang et al., 6 Jan 2026). In the secret informant POMDP, utility-guided episodic agents with reservoir sampling efficiently learn to selectively retain only genuinely informative states, delivering rapid convergence compared to recurrent actor-critic agents (Young et al., 2018).

7. Limitations, Variants, and Outlook

Limitations

  • Reliance on the availability and informativeness of runtime reward signals.
  • Unchecked memory growth in non-reservoir variants poses scaling and indexing overhead.
  • Cold-start conditions: absent learned Q-values, initial retrieval may rely purely on semantic relevance.

Extensions and Active Directions

  • Alternative utility measures, e.g., n-step returns or contextual bandit utility.
  • Memory pruning, hierarchical/structured memory, or compressed representation to cap growth.
  • Application to continuous control with learned state embeddings.
  • Meta-learning of retrieval trade-offs, dynamic β\beta, or Bayesian uncertainty in Q-estimates for guided exploration.
  • Empirical study of memory saturation and trade-offs in high-throughput, real-world deployments.

Research in Episodic Utility-Guided Memory has established its efficacy in bridging stable core inference with adaptive, continuous learning through non-parametric memory, substantiated by both theoretical guarantees and broad empirical validation (Zhang et al., 6 Jan 2026, Young et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Episodic Utility-Guided Memory (MemRL).