Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning (2508.19828v2)

Published 27 Aug 2025 in cs.CL and cs.MA

Abstract: LLMs have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a RL framework where a Memory Manager and an Answer Agent collaboratively optimize memory operations to boost answer accuracy.
  • It employs PPO and GRPO to fine-tune agents for CRUD-style memory updates and selective distillation of relevant dialogue context.
  • The model outperforms baselines on LOCOMO with substantial gains in F1, BLEU-1, and semantic correctness, even with minimal training data.

Memory-R1: Reinforcement Learning for Memory Management in LLM Agents

Introduction

Memory-R1 presents a reinforcement learning (RL) framework for augmenting LLM agents with adaptive, structured memory management and utilization capabilities. The stateless nature of LLMs, constrained by finite context windows, limits their ability to perform long-horizon reasoning and maintain persistent knowledge across multi-session dialogues. Existing approaches typically rely on static, heuristic-driven memory pipelines, which are suboptimal for dynamic, evolving conversational contexts. Memory-R1 addresses these limitations by introducing two RL-fine-tuned agents: a Memory Manager for CRUD-style memory operations and an Answer Agent for selective memory distillation and reasoning.

Methodology

Memory-R1 Architecture

Memory-R1 consists of two specialized components:

  • Memory Manager: Trained via RL (PPO or GRPO), this agent decides whether to ADD, UPDATE, DELETE, or NOOP for each new piece of information extracted from dialogue turns. The manager operates over a temporal memory bank, incrementally evolving the memory state to maximize downstream QA performance.
  • Answer Agent: Also RL-fine-tuned, this agent receives up to 60 candidate memories retrieved via RAG for each question. It applies a Memory Distillation policy to filter and select the most relevant entries, then generates the final answer conditioned on the distilled context.

Both agents are trained with outcome-driven rewards, using exact match between predicted and gold answers as the primary signal. The RL setup enables the agents to learn memory operations and utilization strategies that directly optimize for answer correctness, rather than relying on manually annotated intermediate supervision.

RL Fine-Tuning Procedures

  • PPO (Proximal Policy Optimization): Used for both agents, PPO stabilizes policy updates via a clipped surrogate objective, ensuring robust convergence. The reward is derived from the improvement in answer accuracy after memory operations.
  • GRPO (Group Relative Policy Optimization): An alternative to PPO, GRPO samples groups of candidate actions and computes relative advantages within the group, obviating the need for a learned value function and improving sample efficiency.

The reward function for both agents is strictly outcome-based, defined as Ranswer=EM(ypred,ygold)R_{answer} = \mathrm{EM}(y_{\text{pred}}, y_{\text{gold}}), where EM is the exact match score.

Data Construction

Training data is constructed from the LOCOMO benchmark, which features multi-turn, multi-session dialogues and associated QA pairs. For the Memory Manager, each training tuple consists of a dialogue turn, a temporal memory bank (preceding 50 turns), and QA pairs. For the Answer Agent, each tuple includes a question, 60 retrieved candidate memories, and the gold answer.

Experimental Results

Benchmarking and Metrics

Memory-R1 is evaluated on the LOCOMO benchmark using LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct backbones. Metrics include token-level F1, BLEU-1, and LLM-as-a-Judge (semantic correctness). Baselines include LOCOMO, Zep, A-Mem, LangMem, and Mem0, all re-implemented for consistency.

Main Findings

  • Performance: Memory-R1-GRPO achieves an overall F1 of 45.02, BLEU-1 of 37.51, and LLM-as-a-Judge of 62.74 on LLaMA-3.1-8B, outperforming Mem0 by 68.9% (F1), 48.3% (BLEU-1), and 37.1% (Judge). Similar gains are observed on Qwen-2.5-7B.
  • Data Efficiency: Strong generalization is achieved with as few as 152 training QA pairs, demonstrating high sample efficiency.
  • Component Analysis: RL fine-tuning of both Memory Manager and Answer Agent yields substantial improvements over vanilla LLMs. Memory Distillation further enhances answer accuracy by filtering out irrelevant context.
  • Policy Comparison: GRPO converges faster than PPO but both reach comparable final performance.

Ablation and Case Studies

  • RL-trained Memory Manager consolidates overlapping or complementary information via UPDATE operations, avoiding fragmentation and loss of context observed in vanilla managers.
  • RL-trained Answer Agent with Memory Distillation reliably selects relevant memories, improving factual accuracy and robustness to distractors.

Implementation Considerations

Resource Requirements

  • Training is performed on 4×H100 GPUs (80GB each), with batch size 128 and micro-batch size 2 per GPU.
  • Maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.
  • PPO requires actor and critic networks; GRPO only trains the actor.

Deployment Strategies

  • RL fine-tuning can be performed with minimal supervision, making Memory-R1 suitable for real-world applications with limited labeled data.
  • The modular architecture allows integration with various LLM backbones and memory retrieval systems.

Limitations

  • The outcome-based reward design may not capture nuanced memory relevance in cases where answer correctness is insufficiently sensitive to memory operations.
  • Scaling to extremely large memory banks may require further optimization of retrieval and distillation mechanisms.

Implications and Future Directions

Memory-R1 demonstrates that RL is an effective paradigm for teaching LLM agents adaptive memory management and utilization, enabling persistent, long-horizon reasoning. The framework sets a new state of the art on LOCOMO and generalizes across model architectures. Future research may explore:

  • Compositional memory architectures for hierarchical or multi-modal memory.
  • Integration with lifelong learning and continual adaptation.
  • More sophisticated reward functions incorporating intermediate reasoning steps or human feedback.
  • Scaling to open-domain, multi-agent environments.

Conclusion

Memory-R1 establishes RL as a principled approach for equipping LLM agents with agentic, memory-aware behavior. By jointly optimizing memory operations and answer generation, the framework achieves substantial gains in long-term conversational reasoning with minimal supervision. The results highlight the potential of RL for advancing persistent knowledge retention and adaptive reasoning in LLM-based systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com