Mem-α: RL-Driven Memory Construction

Updated 6 December 2025

The paper introduces a reinforcement learning framework that optimizes memory construction via a multi-component external memory system, improving question answering accuracy.
It formulates memory updates as a sequential Markov Decision Process using specialized actions and Group-Relative PPO to balance syntactic, semantic, and compression rewards.
Empirical evaluations reveal that Mem-α outperforms traditional methods by generalizing to long-sequence tasks with reduced memory footprint and increased efficiency.

Mem-α is a reinforcement learning framework for training LLM agents to construct, manage, and update external memory systems. Its design targets the limitations inherent to LLMs’ context windows, providing agents with mechanisms to determine not only what information to store but also how to structure and when to update it. Mem-α directly optimizes for downstream question answering accuracy using a combination of RL rewards and a multi-component memory schema, resulting in substantial improvements in information retention, compression, and generalization to long-sequence tasks (Wang et al., 30 Sep 2025).

1. Reinforcement Learning Formulation for Memory Construction

Mem-α formalizes memory construction as a sequential Markov Decision Process (MDP) where, at each step $t$ , the agent processes an input chunk $c_t$ and a memory state $\mathcal M_{t-1}$ comprising core, semantic, and episodic memories. The agent’s action $a_t$ is a sequence of up to $K_t$ function calls from $\mathcal A_\mathrm{write} = \{\texttt{memory\_insert}, \texttt{memory\_update}, \texttt{memory\_delete}\}$ , each parameterized by memory type, entry id, and content. These operations update the memory via:

$\mathcal M_t = T(\cdots\,T(\mathcal M_{t-1}, a_t^{(1)}), \dots, a_t^{(K_t)})$

Reward signals aggregate four components:

$r_1$ : QA correctness via a frozen RAG pipeline,
$r_{2,t}$ : syntactic validity of tool use,
$r_3$ : memory compression,
$r_{4,t}$ : semantic validity judged by a LLM.

The objective is maximized using Group-Relative PPO (GRPO) without KL penalty, employing a normalized advantage $A_t$ computed over rollout groups. This arrangement allows Mem-α to learn not just efficient memory update policies, but also multi-step strategies to anticipate downstream QA coverage and compression tradeoffs (Wang et al., 30 Sep 2025).

2. Multi-Component Memory Architecture

Mem-α introduces a tripartite external memory system, each component purpose-built for distinct information types and operational constraints:

Core Memory: A single, continuously updated text summary (≤512 tokens) resides in the model prompt, ensuring rapid access to holistic context. Only \texttt{memory_update} operations are permitted, overwriting the core entry.
Semantic Memory: An unbounded, discrete set of short fact entries (e.g. “Alice’s favorite genre is science fiction”), supporting \texttt{insert}, \texttt{update}, and \texttt{delete} operations for granular knowledge management.
Episodic Memory: A chronologically ordered ledger of timestamped events, representing sequential actions or responses, also modifiable by \texttt{insert}, \texttt{update}, and \texttt{delete}.

During training, agents are required to write and update these memory types as they process input streams, whereas retrieval for QA uses BM25 scoring over semantic and episodic pools, decoupled from RL policy optimization (Wang et al., 30 Sep 2025).

3. Training Dataset and Procedure

Mem-α leverages a curated RL dataset comprising 4,139 multi-turn episodes from eight annotated sources, balanced down to 562 episodes for RL stability. Tasks are stratified into:

Accurate Retrieval (AR): Fact storage across multi-chunk contexts with large downstream QA sets (SQuAD, HotpotQA, PerLTQA, LongMemEval-Train).
Test-Time Learning (TTL): On-the-fly induction of new classification rules from evolving examples (PubMed-RCT, NLU, TREC-C).
Long-Range Understanding (LRU): Book-length decomposition and summarization (BookSum).

Each episode is partitioned into 8–16 chunks totaling ≤30,000 tokens. Agents (Qwen3-4B backbone) are fine-tuned with GRPO for three days over 32 H100 GPUs, selected by validation set performance after approximately 205 updates (Wang et al., 30 Sep 2025).

4. Empirical Evaluation and Comparative Performance

Evaluation spans both in-distribution and out-of-distribution (MemoryAgentBench) scenarios, including AR, TTL, and LRU benchmarks:

Method	AR (Val.)	TTL (Val.)	LRU (Val.)	Avg (Val.)	Memory (Val.)	AR (Ood)	TTL (Ood)	LRU (Ood)	Avg (Ood)	Memory (Ood)
Long-Context	0.742	0.621	0.052	0.588	10.8K	0.280	0.558	0.125	0.461	33K
RAG-Top2	0.762	0.563	0.042	0.567	11.3K	0.690	0.610	0.065	0.502	207K
MemAgent	0.091	0.398	0.103	0.236	0.84K	0.070	0.325	0.043	0.198	0.92K
MEM1	0.039	0.167	0.085	0.111	0.17K	0.070	0.090	0.029	0.071	0.21K
Mem-α	0.786	0.623	0.187	0.642	7.9K	0.740	0.574	0.129	0.592	129K

Mem-α exhibits the strongest performance across all QA paradigms, with robust generalization to unseen domains and novel-length input streams (up to 474K tokens, $\sim 13 \times$ training token limit). At test time, its memory footprint remains roughly half that of a retrieval-only approach, indicative of superior compression and information selection (Wang et al., 30 Sep 2025).

5. Reward Design, Ablation, and Agent Behavior

Task success depends critically on carefully chosen reward components:

Excluding the semantic-validity reward ( $\gamma=0$ ) degrades memory coherence, indicating that QA-driven reward alone is insufficient to elicit precise memory entries.
The compression reward ( $\beta$ ) tunes size-performance tradeoff, with optimal balance at $\beta=0.05$ .
Case studies highlight that Mem-α maintains a concise core summary, precise semantic facts, and appropriate episodic merges across long interaction histories. Baseline LMs either leave memory fields blank or over-duplicate entries, failing to exploit memory structure.

A plausible implication is that multi-component, RL-driven reward design is essential for meaningful memory construction in LLM agents and that distinct memory types require distinct operational constraints for coherent task performance (Wang et al., 30 Sep 2025).

6. Historical Context and Comparative Methods

Preceding approaches to RL-based memory management include Memory Augmented Self-Play (Sodhani et al., 2018)—which leverages LSTM-based episodic summaries in exploration-intensive environments—and Self Memory Management (SMM) with intrinsic motivation (Demir, 2021)—where agents decide what to memorize based on observation rarity in partially observable domains. These methods demonstrate the generality of RL for memory optimization but are restricted to either episode-level recurrences or low-capacity windowed sequences.

Mem-α advances these paradigms by supporting fine-grained, heterogeneous memory schemas and by coupling reward directly to downstream performance, not merely exploration diversity or intrinsic cues.

7. Limitations and Prospective Extensions

Current Mem-α instantiations do not address consistency enforcement in memory (e.g., resolving contradictory facts), nor do they interface with production-grade databases. Scaling may require advances in latency, safety, durability, and multimodal integration. The framework is extensible: future variants could incorporate hierarchical memory organization, end-to-end training of retrieval/generation components, or support for images and tabular data as memory entries.

This suggests that robust RL-driven memory is foundational for continuous, lifelong learning, cross-document abstraction, and open-domain reasoning in next-generation LM agents (Wang et al., 30 Sep 2025).

Mem-α establishes the first RL-based, multi-part, externally managed memory system where agents autonomously learn what, how, and when to store information, with capabilities validated by generalization beyond training scale and substantial improvements over prior baselines (Wang et al., 30 Sep 2025).