ReMem in LLM Agent Test-Time Learning

Updated 9 March 2026

ReMem is a dynamic memory architecture that enables LLM agents to adapt at test time using continuous, multi-stage experience distillation.
It integrates retrieval and context-adaptive rewriting to transform agent experiences into actionable procedural knowledge, leveraging both success patterns and failure triggers.
The approach achieves state-of-the-art performance in tool-use and sequential reasoning tasks, demonstrating significant improvements over traditional retrieval-only methods.

ReMem is a dynamic procedural memory architecture specifically designed to enable genuine test-time learning in LLM agents via continually evolving, experience-driven memory. Unlike static, append-only memory systems, ReMem tightly integrates multistage distillation of agent experience, context-adaptive retrieval with rewriting, and empirical utility-driven refinement, yielding a feedback loop that allows LLM agents to internalize, adapt, and reorganize “how-to” knowledge from ongoing interactions without parameter updates. This approach systematically addresses the limitations of passive memory accumulation and establishes new state-of-the-art results in agent lifelong learning, as demonstrated in tool-use and sequential reasoning tasks (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).

1. Architectural Foundations and Problem Scope

ReMem operates within the growing paradigm of memory-augmented LLM agents deployed in continuous, evolving task streams. At each task time step $t$ , the agent maintains a stateful memory $M_t = \{m_1, \dots, m_{|M_t|}\}$ comprising distilled “mini-experiences” and a task-based state $s_t^n = (x_t, M_t, o_t^{1:n-1})$ , where $o_t^{1:n-1}$ denotes intermediate reasoning traces. The agent selects from discrete, policy-driven actions: Think, Act, and Refine.

The core goal is to enable progressive optimization of the agent’s behavioral policy across sequential, non-stationary settings, overcoming the bottleneck of shallow retrieval-only memory seen in baseline methods such as ExpRAG. In the ReMem pipeline, memory is neither static nor monolithic, but is actively curated and contextually adapted at each test-time step, forming the basis of continual agent improvement (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).

2. Multi-Faceted Experience Distillation

ReMem’s memory entries are built from explicit multi-faceted experience distillation, surpassing simple storage of raw trajectories. For each episode, $N$ sampled trajectories $\{\tau_i\}_{i=1}^N$ are ranked by reward $r(\tau)$ , from which top- $k$ successful ( $\mathcal{S}$ ) and bottom- $k$ failed ( $\mathcal{F}$ ) interactions are selected.

The summarizer LLM then extracts three complementary types of reusable procedural knowledge:

Success Patterns: Steps $e^+_j$ maximizing agreement with successful episodes, optimized via

$L_{\rm success} = \sum_{j} \sum_{\tau \in \mathcal{S}} \|\phi(e^+_j) - \phi_{\rm steps}(\tau) \|_2^2$

Failure Triggers: Steps $e^-_j$ strongly correlated with unsuccessful outcomes, penalized when the summarizer fails to capture the earliest failure point:

$L_{\rm failure} = \sum_{j} \sum_{\tau \in \mathcal{F}} \left[ \max(0, d(e^-_j, \tau) - \delta ) \right]^2$

Comparative Insights: For each success–failure pair $(\tau^+, \tau^-)$ , comparative experiences $e_j^{\rm cmp}$ are derived, with a contrastive objective:

$s_{\rm cmp}(e) = \langle \phi(e), \phi_{\rm diff}(\tau^+, \tau^-) \rangle$

$L_{\rm compare} = -\sum_j s_{\rm cmp}(e_j^{\rm cmp})$

The joint distillation objective $L_{\rm distill} = L_{\rm success} + L_{\rm failure} + L_{\rm compare}$ yields high-quality, keypoint-level procedural experiences parametrized as $E = \langle \omega, e, \kappa, c, \tau \rangle$ , where $\omega$ is the usage scenario, $e$ is distilled experience, $\kappa$ are keywords, $c$ is confidence, and $\tau$ are tool tags (Cao et al., 11 Dec 2025).

3. Context-Adaptive Memory Retrieval and Reuse

At test time, ReMem emphasizes not only memory retrieval but also scenario-conditioned adaptation. For a new input $q$ , memory entries $m$ are retrieved using cosine similarity between $q$ and each entry’s scenario embedding $\phi(\omega_m)$ :

$s(q, m) = \frac{\phi(q) \cdot \phi(\omega_m)}{ \| \phi(q) \| \; \| \phi(\omega_m) \| }$

The agent retrieves the top- $K$ entries, optionally reranks with a secondary LLM, and computes a soft attention-weighted context:

$\alpha_i = \frac{ \exp( \lambda s(q, m_i) ) }{ \sum_{j=1}^K \exp( \lambda s(q, m_j) ) }, \qquad C_{\rm mem} = \sum_{i=1}^K \alpha_i e_i$

A context-rewriting module then adapts each memory trace for compatibility with the current scenario; the composed input $[q;\;C_{\rm mem}]$ is forwarded to the execution LLM for action or answer generation. This mechanism ensures that experiences are dynamically aligned to evolving task constraints and context at each test step (Cao et al., 11 Dec 2025).

ReMem maintains a self-regulating memory pool through continual empirical utility estimation. For each memory entry $m$ , two counters are tracked: $f(m)$ (retrievals) and $u(m)$ (success contributions). Addition is selective: only distilled trajectories with confidence $c > E_{\rm add}$ are incorporated, and at most $R_{\max}=3$ self-reflections are executed after failures.

Pruning proceeds as:

$\phi_{\rm remove}(m) = \begin{cases} 1, & f(m) \ge \alpha \ \wedge\ \frac{u(m)}{f(m)} \le \beta \ 0, & \text{otherwise} \end{cases}$

where $\alpha$ is a threshold for minimum retrievals and $\beta$ a utility cutoff. Low-utility or obsolete experiences are periodically deleted to maintain a compact, high-quality memory pool. This active curation distinguishes ReMem from passive log-aggregation approaches (Cao et al., 11 Dec 2025).

5. Integrated Test-Time Learning Loop

The full ReMem workflow constitutes an iterative experiential feedback system at test time. The agent alternates between:

Retrieval and context adaptation of relevant experiences
Execution and, if necessary, multi-step self-reflection via memory adaptation
Success-driven extraction of new distilled experiences
Per-instance utility updates and pruning

The corresponding pseudocode is as follows:

for each new task q:
    candidates = Retrieve(MemoryPool, q)
    adapted_ctx = Rewrite(candidates, q)
    output, success = LLM_execute(adapted_ctx)
    if not success and reflections < R_max:
        fail_insights = SummarizeFailure(output)
        adapted_ctx = AdaptWith(fail_insights, q)
        reflections += 1
        goto Execute
    if success:
        new_exps = SummarizeSuccess(output)
        MemoryPool.add(new_exps)
    MemoryPool.update_stats(candidates, success)
    MemoryPool.prune(alpha, beta)

This loop achieves agent self-improvement via continual procedural adaptation rather than weight updates. The process realizes “test-time self-improving” agents in the strong sense: each interaction may trigger further evolution of both reasoning policy and procedural knowledge base (Cao et al., 11 Dec 2025).

6. Comparative Performance and Empirical Results

ReMem has been evaluated on multi-tool and coding benchmarks such as BFCL-V3 and AppWorld, as well as a broad suite of single-turn and multi-turn reasoning datasets (AIME, GPQA-Diamond, MMLU-Pro, ToolBench, AlfWorld, BabyAI, ScienceWorld, Jericho, PDDL tasks) (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).

Empirical findings include:

Model/Memory	Avg@4 (%)	Pass@4 (%)
Qwen3-8B (no mem)	27.65	46.20
Qwen3-8B + ReMe	34.94	55.03
Qwen3-14B (no mem)	35.62	54.65

Notably, Qwen3-8B with ReMe matches or surpasses the larger Qwen3-14B baseline on Pass@4, demonstrating a pronounced “memory-scaling effect.” A similar effect is observed in benchmarks with Gemini-2.5 and Claude backbones: ReMem attains higher accuracy, step efficiency, and sequence robustness over ExpRAG or retrieval-only baselines. Under conditions of noisy or heterogeneous feedback, ReMem maintains superior performance due to its principled memory pruning and reasoning–memory integration. Correlation analyses indicate gains scale with task similarity ( $r \approx 0.72$ ) (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).

7. Systemic Comparison and Design Implications

Compared to retrieval-augmented approaches like ExpRAG, which append episodic memories and perform single-pass retrieval, ReMem’s action–think–memory-refine paradigm involves:

Explicit multi-stage reasoning step (“Think”)
Dedicated memory organization/refinement phase (“Refine”)
Policy-driven alternation between reasoning, action, and memory maintenance
Continual memory curation integrating feedback, summarization, and utility pruning

This design supports robust, computation-efficient test-time learning without explicit retraining or parameter updates. End-to-end, ReMem’s architecture offers an operational template for constructing LLM agents capable of genuine lifelong procedural memory formation, adaptation, and exploitation, as required in real-world continuous learning environments (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution (2025)

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReMem in LLM Agent Test-Time Learning.

ReMem in LLM Agent Test-Time Learning

1. Architectural Foundations and Problem Scope

2. Multi-Faceted Experience Distillation

3. Context-Adaptive Memory Retrieval and Reuse

4. Utility-Based Memory Refinement and Pruning

5. Integrated Test-Time Learning Loop

6. Comparative Performance and Empirical Results

7. Systemic Comparison and Design Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

ReMem in LLM Agent Test-Time Learning

1. Architectural Foundations and Problem Scope

2. Multi-Faceted Experience Distillation

3. Context-Adaptive Memory Retrieval and Reuse

4. Utility-Based Memory Refinement and Pruning

5. Integrated Test-Time Learning Loop

6. Comparative Performance and Empirical Results

7. Systemic Comparison and Design Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics