MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2507.02259v1)

Published 3 Jul 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

Summary

The paper introduces a novel reinforcement learning approach for optimizing a token-based memory agent that enables LLMs to process arbitrarily long contexts with linear complexity.
The methodology segments input documents and iteratively updates memory, ensuring efficient processing with a fixed memory size.
Empirical results show high accuracy and robust generalization on extended contexts, outperforming traditional models on long-context benchmarks.

MemAgent: Reinforcement Learning-Based Memory Agents for Long-Context LLMs

The paper "MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent" (2507.02259) presents a novel approach to overcoming the limitations of LLMs in processing extremely long contexts. The authors introduce MemAgent, a reinforcement learning (RL)-trained memory agent that enables LLMs to handle arbitrarily long input sequences with linear computational complexity and minimal performance degradation, even when extrapolating far beyond their original context window.

Motivation and Context

Despite advances in LLM architectures and context window extension techniques, current models face significant challenges when tasked with reasoning over or extracting information from documents that far exceed their native context length. Existing solutions—such as positional embedding extrapolation, sparse/linear attention, and context compression—either suffer from quadratic computational costs, require retraining from scratch, or introduce architectural complexity and compatibility issues. The need for a scalable, efficient, and robust mechanism for long-context reasoning remains unmet.

MemAgent Architecture and Workflow

MemAgent addresses these challenges by introducing a fixed-length, token-based memory that is iteratively updated as the model processes a long document in segments. The workflow is as follows:

Segmentation: The input document is divided into manageable chunks.
Memory Update: For each chunk, the model receives the current memory state and the new chunk, then generates an updated memory summarizing all relevant information seen so far.
Final Answer Generation: After all chunks are processed, the model produces the final output using only the accumulated memory and the original query.

This design ensures that the memory size—and thus the context window—remains constant, yielding $O(N)$ computational complexity with respect to input length. The memory is represented as a sequence of tokens, making it both human-interpretable and compatible with standard transformer architectures.

RL-Based Memory Optimization

A key innovation is the use of reinforcement learning to train the memory update policy. The overwrite decision at each step is treated as an RL action, with rewards based on the utility of the memory for producing correct final answers. The authors employ a multi-conversation variant of the DAPO algorithm, which allows for efficient optimization across multiple independent context-processing trajectories. This approach directly incentivizes the model to retain only information critical for downstream tasks, discarding distractors and redundant content.

Theoretical Implications

MemAgent reframes the standard autoregressive modeling paradigm by introducing a latent memory variable. The joint probability of the input sequence is factorized through a sequence of read (chunk processing) and write (memory update) operations. This decomposition effectively transforms the transformer into a recurrent model with a user-controllable state size, while preserving the vanilla decoder's training and inference recipes.

Unlike feature-space compression in linear or local-global attention models, MemAgent's token-level memory is explicit and interpretable. This property facilitates reward modeling and debugging, and opens avenues for user intervention or editing of intermediate memory states.

Empirical Results

The authors conduct extensive experiments on synthetic and real-world long-context QA tasks, primarily using the RULER-HotpotQA benchmark. Key findings include:

Length Extrapolation: MemAgent-trained models with an 8K context window (1024-token memory, 5000-token chunk size) maintain high accuracy (over 75%) on documents up to 3.5 million tokens, with negligible performance drop.
Baseline Comparison: Competing models—whether using extended context windows, sparse attention, or post-training—exhibit rapid performance degradation as input length increases, often failing well before reaching their theoretical maximum context size.
Ablation Studies: The memory mechanism alone provides some benefit, but RL training is essential for robust long-context generalization. Models without RL show significant accuracy loss as context grows.
Generalization: MemAgent demonstrates strong performance on out-of-domain tasks, including variable tracking and word frequency extraction, indicating that the learned memory policy is not overfitted to specific data formats.

Numerical Highlights

Model	7K	112K	896K	1.75M	3.5M
RL-MemAgent-14B	83.6	76.6	77.3	76.6	78.1
Qwen2.5-Instruct-14B-1M	60.2	50.0	0.0	N/A	N/A
QwenLong-L1-32B	72.7	31.3	11.7	N/A	N/A

MemAgent's performance remains stable across increasing context lengths, while baselines collapse.

Implementation Considerations

Compatibility: MemAgent requires no architectural changes to the base transformer; the memory is managed at the prompt level.
Training: RL training is performed using the GRPO algorithm with group normalization, and outcome rewards are computed via rule-based verifiers or equivalence checks.
Resource Requirements: The approach is computationally efficient, with linear scaling in both FLOPs and memory usage. Training can be performed on standard hardware used for LLM fine-tuning.
Deployment: At inference, the model processes documents in a streaming fashion, updating memory and generating outputs in multiple passes. This enables practical deployment for applications such as book-length document QA, multi-step reasoning, and agent systems requiring persistent memory.

Limitations and Future Directions

While MemAgent achieves strong results, several open questions remain:

Memory Size Selection: The optimal memory length may depend on task complexity and document structure. Adaptive or hierarchical memory mechanisms could further improve efficiency.
Reward Modeling: The reliance on rule-based or equivalence rewards may limit applicability to tasks with ambiguous or open-ended answers.
Integration with External Tools: Combining MemAgent with retrieval-augmented generation or tool-use agents could enhance performance on knowledge-intensive tasks.

Implications and Outlook

MemAgent provides a principled and practical solution to the long-context trilemma: arbitrary input length, lossless extrapolation, and linear computational cost. By leveraging RL to optimize a token-based memory, the approach bridges the gap between explicit supervision and implicit memory management. This work has significant implications for the development of scalable, efficient, and interpretable LLM systems capable of handling real-world, document-scale reasoning tasks.

Future research may explore adaptive memory policies, integration with external knowledge sources, and applications to multi-agent or continual learning scenarios. The explicit, human-readable memory also opens possibilities for interactive and explainable AI systems.