MemoryRewardBench: Evaluating LLM Memory Supervision

Updated 24 January 2026

MemoryRewardBench is a benchmark designed to evaluate reward models' ability to supervise long-term memory updates in large language models.
It applies three core tasks—long-context reasoning, multi-turn dialogue, and long-form generation—across varied token lengths and memory patterns.
Empirical findings reveal that while proprietary models excel under outcome-based evaluation, all reward models face challenges like process-based bias and degradation at ultra-long contexts.

$\texttt{MemoryRewardBench}$ is a systematic evaluation benchmark designed to assess the capability of reward models (RMs) to judge long-term memory management in LLMs. As LLMs increasingly employ memory-centric mechanisms to process long contexts by segmenting inputs and updating intermediate memory states, reliable and fine-grained supervision of these memory updates has become a central challenge. $\texttt{MemoryRewardBench}$ addresses this by offering a comprehensive suite of tasks, settings, and evaluation metrics that expose both the strengths and limitations of current RMs in long-horizon scenarios, spanning tens or hundreds of thousands of tokens (Tang et al., 17 Jan 2026).

1. Motivation and Scope

Modern LLMs process extremely long contexts by dividing input sequences into manageable chunks. Information preserved between segments—the “memory”—is essential for global coherence and accurate downstream reasoning. As the length and complexity of contexts increase, traditional quality control based on final outcomes and manual human annotation becomes insufficient. $\texttt{MemoryRewardBench}$ is motivated by two questions:

Can existing RMs accurately distinguish between effective and ineffective memory-management processes in LLMs?
Where do current RMs excel or fail, particularly as context length and memory strategies become more sophisticated?

Unlike prior benchmarks that focus on LLM performance, $\texttt{MemoryRewardBench}$ targets the evaluation of RMs as preference judges over candidate memory update trajectories.

2. Benchmark Design and Task Coverage

$\texttt{MemoryRewardBench}$ features a tripartite task structure and ten distinct settings, designed to test the generalizability and robustness of RMs across a diversity of memory patterns and context lengths.

Core Task Types

Long-Context Reasoning: Agents process long documents in segments, updating a fixed-size memory after each chunk, with a global question answered at the end.
Multi-Turn Dialogue Understanding: Agents track and update memory over hundreds of dialogue turns, subsequently retrieving relevant information to address a downstream query.
Long-Form Generation: Agents generate structured text stepwise or in parallel, where each generative step’s output informs subsequent memory.

Task Settings and Memory Patterns

Task	Setting name	Pattern
Long-Context	Sequential-Noise	Sequential
	Sequential-Drop	Sequential
	Mixed-Noise	Mixed
	Mixed-Drop	Mixed
Multi-Turn	Mem0-Out	Sequential
Dialogue	Mem0-Mem	Sequential
	A-Mem-Out	Sequential
	A-Mem-Mem	Sequential
Long-Form	Sequential	Sequential
Generation	Parallel	Parallel

Each setting is instantiated at five context lengths: 8K, 16K, 32K, 64K, and 128K tokens, yielding 2400 preference pairs in total. The benchmark formalizes three canonical memory patterns:

Sequential
- $m_1 = \Phi(c_1)$
- $m_t = \Phi(m_{t-1}, c_t)$ for $t = 2,…,n$
- ${\rm Outcome} = {\rm decode}(m_n)$
Parallel
- Input partitioned into groups $G_1,…,G_k$ , memory $\texttt{MemoryRewardBench}$ 0 for each group, final output $\texttt{MemoryRewardBench}$ 1
Mixed
- Composition of parallel block processing followed by sequential fusion and update

3. Evaluation Protocols and Metrics

$\texttt{MemoryRewardBench}$ 2 assesses reward models using two primary evaluation criteria:

Type 1 (Outcome-Based): If one trajectory produces a correct answer and the other is incorrect, the RM should select the correct trajectory.
Type 2 (Process-Based): When both trajectories yield the correct answer, the RM should prefer the trajectory with cleaner and logically coherent memory updates.

Formal Metrics

Judgment Accuracy: Proportion of pairs where the RM’s choice matches the gold-standard “better” trajectory (random guessing = 50%).
Consistency Under Position Swap: Each preference pair is presented twice with the order swapped; a robust RM should maintain the same choice, exposing positional bias if accuracy falls below 50%.
Constraint-Adherence Curve: For long-form generation tasks, RM accuracy is measured as the proportion of instruction constraints visible is varied, characterizing sensitivity to constraint density.

4. Models Evaluated

A representative sample of the latest proprietary and open-source reward models was assessed:

Proprietary: Claude-Opus-4.5, Gemini 3 Pro, Qwen3-Max
Open-Source: GLM4.5-106A12B, Qwen2.5-72B, Qwen2.5-7B, Qwen3-235B, Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Llama 3.3-70B-Instruct, Llama 3.1-8B-Instruct

These RMs were tasked with serving as preference judges on the full suite of $\texttt{MemoryRewardBench}$ 3 scenarios.

5. Empirical Findings

Overall Performance

Proprietary models: Claude-Opus-4.5 leads with 74.8% accuracy, followed by Gemini3-Pro (71.6%) and Qwen3-Max (67.8%).
Open-source: GLM4.5-106A12B (68.2%) outperforms Qwen3-Max and closes the gap with proprietary models.
The performance gap is most narrow on long-context reasoning and widens on tasks requiring temporal tracking or dense constraint management.

Model Scale and Generation

Accuracy does not correlate monotonically with parameter scale; newer-generation models outperform older models regardless of size. For example, Qwen3-4B (52.4%) surpasses Qwen2.5-7B (38.2%).
Larger or newer models (e.g., Qwen3-235B at 66.6%) outperform significant prior-generation models, e.g., Llama 3.3-70B-Instruct (52.9%).

Task and Pattern Difficulty

Long-Context Reasoning: Easiest setting, with many RMs exceeding 60% accuracy.
Long-Form Generation: Intermediate difficulty, demanding persistent adherence to abstract generation constraints.
Multi-Turn Dialogue Understanding: Most challenging; RMs often fail to track correct and comprehensive memory states.
Memory Pattern Effects: RMs accurately judge sequential processes (70–80% accuracy) but underperform on parallel and mixed patterns (mid-60%), revealing a predisposition for stepwise causal reasoning.

Biases and Failures

Under outcome-based evaluation, RMs are robust ( $\texttt{MemoryRewardBench}$ 480% accuracy).
In process-based evaluation (equal outcomes), RMs are subject to positional bias, preferring the first sample and yielding consistency below 50%.
As the number of explicit instructions increases, RM accuracy peaks near 80% at moderate constraint densities (25%), but additional constraints have diminishing or negative effect.
Performance remains above random up to 64K context, but both accuracy and positional consistency degrade rapidly at longer contexts (notably 128K tokens), even for large models like Llama 3.3-70B-Instruct.

Memory-Enhancement Signals

Adding semantic tags (A-Mem) to dialogue memory substantially improves RM accuracy, with gains of 10–15% compared to simple untagged summaries.

6. Conclusions and Research Directions

$\texttt{MemoryRewardBench}$ 5 identifies substantive limitations in the current generation of reward models for long-term memory supervision:

Persistent process and positional biases undermine process-based learning.
RMs are ill-equipped to judge parallel and mixed memory management strategies.
Model fragility increases abruptly beyond 64K-token contexts.
Dense, multifaceted constraints in generative tasks exceed the granularity current RMs can consistently enforce.

Emerging directions indicated by these findings include the development of sample-order-invariant RMs, process-centric evaluation architectures, coverage of parallel and mixed patterns through synthetic data, hierarchical evaluation for ultra-long contexts, and the systematic use of auxiliary metadata (temporal tags, semantic summaries, or control signals) as guiding factors for RM supervision.

By formalizing a broad suite of memory management challenge scenarios and developing rigorous, scalable metrics for their evaluation, $\texttt{MemoryRewardBench}$ 6 provides both a key diagnostic tool for RM quality and a strategic blueprint for the evolution of memory-centric supervision in LLMs (Tang et al., 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

$\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemoryRewardBench.