- The paper introduces a RL framework where a Memory Manager and an Answer Agent collaboratively optimize memory operations to boost answer accuracy.
- It employs PPO and GRPO to fine-tune agents for CRUD-style memory updates and selective distillation of relevant dialogue context.
- The model outperforms baselines on LOCOMO with substantial gains in F1, BLEU-1, and semantic correctness, even with minimal training data.
Memory-R1: Reinforcement Learning for Memory Management in LLM Agents
Introduction
Memory-R1 presents a reinforcement learning (RL) framework for augmenting LLM agents with adaptive, structured memory management and utilization capabilities. The stateless nature of LLMs, constrained by finite context windows, limits their ability to perform long-horizon reasoning and maintain persistent knowledge across multi-session dialogues. Existing approaches typically rely on static, heuristic-driven memory pipelines, which are suboptimal for dynamic, evolving conversational contexts. Memory-R1 addresses these limitations by introducing two RL-fine-tuned agents: a Memory Manager for CRUD-style memory operations and an Answer Agent for selective memory distillation and reasoning.
Methodology
Memory-R1 Architecture
Memory-R1 consists of two specialized components:
- Memory Manager: Trained via RL (PPO or GRPO), this agent decides whether to ADD, UPDATE, DELETE, or NOOP for each new piece of information extracted from dialogue turns. The manager operates over a temporal memory bank, incrementally evolving the memory state to maximize downstream QA performance.
- Answer Agent: Also RL-fine-tuned, this agent receives up to 60 candidate memories retrieved via RAG for each question. It applies a Memory Distillation policy to filter and select the most relevant entries, then generates the final answer conditioned on the distilled context.
Both agents are trained with outcome-driven rewards, using exact match between predicted and gold answers as the primary signal. The RL setup enables the agents to learn memory operations and utilization strategies that directly optimize for answer correctness, rather than relying on manually annotated intermediate supervision.
RL Fine-Tuning Procedures
- PPO (Proximal Policy Optimization): Used for both agents, PPO stabilizes policy updates via a clipped surrogate objective, ensuring robust convergence. The reward is derived from the improvement in answer accuracy after memory operations.
- GRPO (Group Relative Policy Optimization): An alternative to PPO, GRPO samples groups of candidate actions and computes relative advantages within the group, obviating the need for a learned value function and improving sample efficiency.
The reward function for both agents is strictly outcome-based, defined as Ranswer=EM(ypred,ygold), where EM is the exact match score.
Data Construction
Training data is constructed from the LOCOMO benchmark, which features multi-turn, multi-session dialogues and associated QA pairs. For the Memory Manager, each training tuple consists of a dialogue turn, a temporal memory bank (preceding 50 turns), and QA pairs. For the Answer Agent, each tuple includes a question, 60 retrieved candidate memories, and the gold answer.
Experimental Results
Benchmarking and Metrics
Memory-R1 is evaluated on the LOCOMO benchmark using LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct backbones. Metrics include token-level F1, BLEU-1, and LLM-as-a-Judge (semantic correctness). Baselines include LOCOMO, Zep, A-Mem, LangMem, and Mem0, all re-implemented for consistency.
Main Findings
- Performance: Memory-R1-GRPO achieves an overall F1 of 45.02, BLEU-1 of 37.51, and LLM-as-a-Judge of 62.74 on LLaMA-3.1-8B, outperforming Mem0 by 68.9% (F1), 48.3% (BLEU-1), and 37.1% (Judge). Similar gains are observed on Qwen-2.5-7B.
- Data Efficiency: Strong generalization is achieved with as few as 152 training QA pairs, demonstrating high sample efficiency.
- Component Analysis: RL fine-tuning of both Memory Manager and Answer Agent yields substantial improvements over vanilla LLMs. Memory Distillation further enhances answer accuracy by filtering out irrelevant context.
- Policy Comparison: GRPO converges faster than PPO but both reach comparable final performance.
Ablation and Case Studies
- RL-trained Memory Manager consolidates overlapping or complementary information via UPDATE operations, avoiding fragmentation and loss of context observed in vanilla managers.
- RL-trained Answer Agent with Memory Distillation reliably selects relevant memories, improving factual accuracy and robustness to distractors.
Implementation Considerations
Resource Requirements
- Training is performed on 4×H100 GPUs (80GB each), with batch size 128 and micro-batch size 2 per GPU.
- Maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.
- PPO requires actor and critic networks; GRPO only trains the actor.
Deployment Strategies
- RL fine-tuning can be performed with minimal supervision, making Memory-R1 suitable for real-world applications with limited labeled data.
- The modular architecture allows integration with various LLM backbones and memory retrieval systems.
Limitations
- The outcome-based reward design may not capture nuanced memory relevance in cases where answer correctness is insufficiently sensitive to memory operations.
- Scaling to extremely large memory banks may require further optimization of retrieval and distillation mechanisms.
Implications and Future Directions
Memory-R1 demonstrates that RL is an effective paradigm for teaching LLM agents adaptive memory management and utilization, enabling persistent, long-horizon reasoning. The framework sets a new state of the art on LOCOMO and generalizes across model architectures. Future research may explore:
- Compositional memory architectures for hierarchical or multi-modal memory.
- Integration with lifelong learning and continual adaptation.
- More sophisticated reward functions incorporating intermediate reasoning steps or human feedback.
- Scaling to open-domain, multi-agent environments.
Conclusion
Memory-R1 establishes RL as a principled approach for equipping LLM agents with agentic, memory-aware behavior. By jointly optimizing memory operations and answer generation, the framework achieves substantial gains in long-term conversational reasoning with minimal supervision. The results highlight the potential of RL for advancing persistent knowledge retention and adaptive reasoning in LLM-based systems.