ReMem in LLM Agent Test-Time Learning
- ReMem is a dynamic memory architecture that enables LLM agents to adapt at test time using continuous, multi-stage experience distillation.
- It integrates retrieval and context-adaptive rewriting to transform agent experiences into actionable procedural knowledge, leveraging both success patterns and failure triggers.
- The approach achieves state-of-the-art performance in tool-use and sequential reasoning tasks, demonstrating significant improvements over traditional retrieval-only methods.
ReMem is a dynamic procedural memory architecture specifically designed to enable genuine test-time learning in LLM agents via continually evolving, experience-driven memory. Unlike static, append-only memory systems, ReMem tightly integrates multistage distillation of agent experience, context-adaptive retrieval with rewriting, and empirical utility-driven refinement, yielding a feedback loop that allows LLM agents to internalize, adapt, and reorganize “how-to” knowledge from ongoing interactions without parameter updates. This approach systematically addresses the limitations of passive memory accumulation and establishes new state-of-the-art results in agent lifelong learning, as demonstrated in tool-use and sequential reasoning tasks (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).
1. Architectural Foundations and Problem Scope
ReMem operates within the growing paradigm of memory-augmented LLM agents deployed in continuous, evolving task streams. At each task time step , the agent maintains a stateful memory comprising distilled “mini-experiences” and a task-based state , where denotes intermediate reasoning traces. The agent selects from discrete, policy-driven actions: Think, Act, and Refine.
The core goal is to enable progressive optimization of the agent’s behavioral policy across sequential, non-stationary settings, overcoming the bottleneck of shallow retrieval-only memory seen in baseline methods such as ExpRAG. In the ReMem pipeline, memory is neither static nor monolithic, but is actively curated and contextually adapted at each test-time step, forming the basis of continual agent improvement (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).
2. Multi-Faceted Experience Distillation
ReMem’s memory entries are built from explicit multi-faceted experience distillation, surpassing simple storage of raw trajectories. For each episode, sampled trajectories are ranked by reward , from which top- successful () and bottom- failed () interactions are selected.
The summarizer LLM then extracts three complementary types of reusable procedural knowledge:
- Success Patterns: Steps maximizing agreement with successful episodes, optimized via
- Failure Triggers: Steps strongly correlated with unsuccessful outcomes, penalized when the summarizer fails to capture the earliest failure point:
- Comparative Insights: For each success–failure pair , comparative experiences are derived, with a contrastive objective:
The joint distillation objective yields high-quality, keypoint-level procedural experiences parametrized as , where is the usage scenario, is distilled experience, are keywords, is confidence, and are tool tags (Cao et al., 11 Dec 2025).
3. Context-Adaptive Memory Retrieval and Reuse
At test time, ReMem emphasizes not only memory retrieval but also scenario-conditioned adaptation. For a new input , memory entries are retrieved using cosine similarity between and each entry’s scenario embedding :
The agent retrieves the top- entries, optionally reranks with a secondary LLM, and computes a soft attention-weighted context:
A context-rewriting module then adapts each memory trace for compatibility with the current scenario; the composed input is forwarded to the execution LLM for action or answer generation. This mechanism ensures that experiences are dynamically aligned to evolving task constraints and context at each test step (Cao et al., 11 Dec 2025).
4. Utility-Based Memory Refinement and Pruning
ReMem maintains a self-regulating memory pool through continual empirical utility estimation. For each memory entry , two counters are tracked: (retrievals) and (success contributions). Addition is selective: only distilled trajectories with confidence are incorporated, and at most self-reflections are executed after failures.
Pruning proceeds as:
where is a threshold for minimum retrievals and a utility cutoff. Low-utility or obsolete experiences are periodically deleted to maintain a compact, high-quality memory pool. This active curation distinguishes ReMem from passive log-aggregation approaches (Cao et al., 11 Dec 2025).
5. Integrated Test-Time Learning Loop
The full ReMem workflow constitutes an iterative experiential feedback system at test time. The agent alternates between:
- Retrieval and context adaptation of relevant experiences
- Execution and, if necessary, multi-step self-reflection via memory adaptation
- Success-driven extraction of new distilled experiences
- Per-instance utility updates and pruning
The corresponding pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for each new task q: candidates = Retrieve(MemoryPool, q) adapted_ctx = Rewrite(candidates, q) output, success = LLM_execute(adapted_ctx) if not success and reflections < R_max: fail_insights = SummarizeFailure(output) adapted_ctx = AdaptWith(fail_insights, q) reflections += 1 goto Execute if success: new_exps = SummarizeSuccess(output) MemoryPool.add(new_exps) MemoryPool.update_stats(candidates, success) MemoryPool.prune(alpha, beta) |
6. Comparative Performance and Empirical Results
ReMem has been evaluated on multi-tool and coding benchmarks such as BFCL-V3 and AppWorld, as well as a broad suite of single-turn and multi-turn reasoning datasets (AIME, GPQA-Diamond, MMLU-Pro, ToolBench, AlfWorld, BabyAI, ScienceWorld, Jericho, PDDL tasks) (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).
Empirical findings include:
| Model/Memory | Avg@4 (%) | Pass@4 (%) |
|---|---|---|
| Qwen3-8B (no mem) | 27.65 | 46.20 |
| Qwen3-8B + ReMe | 34.94 | 55.03 |
| Qwen3-14B (no mem) | 35.62 | 54.65 |
Notably, Qwen3-8B with ReMe matches or surpasses the larger Qwen3-14B baseline on Pass@4, demonstrating a pronounced “memory-scaling effect.” A similar effect is observed in benchmarks with Gemini-2.5 and Claude backbones: ReMem attains higher accuracy, step efficiency, and sequence robustness over ExpRAG or retrieval-only baselines. Under conditions of noisy or heterogeneous feedback, ReMem maintains superior performance due to its principled memory pruning and reasoning–memory integration. Correlation analyses indicate gains scale with task similarity () (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).
7. Systemic Comparison and Design Implications
Compared to retrieval-augmented approaches like ExpRAG, which append episodic memories and perform single-pass retrieval, ReMem’s action–think–memory-refine paradigm involves:
- Explicit multi-stage reasoning step (“Think”)
- Dedicated memory organization/refinement phase (“Refine”)
- Policy-driven alternation between reasoning, action, and memory maintenance
- Continual memory curation integrating feedback, summarization, and utility pruning
This design supports robust, computation-efficient test-time learning without explicit retraining or parameter updates. End-to-end, ReMem’s architecture offers an operational template for constructing LLM agents capable of genuine lifelong procedural memory formation, adaptation, and exploitation, as required in real-world continuous learning environments (Cao et al., 11 Dec 2025, Wei et al., 25 Nov 2025).