Papers
Topics
Authors
Recent
Search
2000 character limit reached

MemoryRewardBench: Evaluating LLM Memory Supervision

Updated 24 January 2026
  • MemoryRewardBench is a benchmark designed to evaluate reward models' ability to supervise long-term memory updates in large language models.
  • It applies three core tasks—long-context reasoning, multi-turn dialogue, and long-form generation—across varied token lengths and memory patterns.
  • Empirical findings reveal that while proprietary models excel under outcome-based evaluation, all reward models face challenges like process-based bias and degradation at ultra-long contexts.

MemoryRewardBench\texttt{MemoryRewardBench} is a systematic evaluation benchmark designed to assess the capability of reward models (RMs) to judge long-term memory management in LLMs. As LLMs increasingly employ memory-centric mechanisms to process long contexts by segmenting inputs and updating intermediate memory states, reliable and fine-grained supervision of these memory updates has become a central challenge. MemoryRewardBench\texttt{MemoryRewardBench} addresses this by offering a comprehensive suite of tasks, settings, and evaluation metrics that expose both the strengths and limitations of current RMs in long-horizon scenarios, spanning tens or hundreds of thousands of tokens (Tang et al., 17 Jan 2026).

1. Motivation and Scope

Modern LLMs process extremely long contexts by dividing input sequences into manageable chunks. Information preserved between segments—the “memory”—is essential for global coherence and accurate downstream reasoning. As the length and complexity of contexts increase, traditional quality control based on final outcomes and manual human annotation becomes insufficient. MemoryRewardBench\texttt{MemoryRewardBench} is motivated by two questions:

  • Can existing RMs accurately distinguish between effective and ineffective memory-management processes in LLMs?
  • Where do current RMs excel or fail, particularly as context length and memory strategies become more sophisticated?

Unlike prior benchmarks that focus on LLM performance, MemoryRewardBench\texttt{MemoryRewardBench} targets the evaluation of RMs as preference judges over candidate memory update trajectories.

2. Benchmark Design and Task Coverage

MemoryRewardBench\texttt{MemoryRewardBench} features a tripartite task structure and ten distinct settings, designed to test the generalizability and robustness of RMs across a diversity of memory patterns and context lengths.

Core Task Types

  1. Long-Context Reasoning: Agents process long documents in segments, updating a fixed-size memory after each chunk, with a global question answered at the end.
  2. Multi-Turn Dialogue Understanding: Agents track and update memory over hundreds of dialogue turns, subsequently retrieving relevant information to address a downstream query.
  3. Long-Form Generation: Agents generate structured text stepwise or in parallel, where each generative step’s output informs subsequent memory.

Task Settings and Memory Patterns

Task Setting name Pattern
Long-Context Sequential-Noise Sequential
Sequential-Drop Sequential
Mixed-Noise Mixed
Mixed-Drop Mixed
Multi-Turn Mem0-Out Sequential
Dialogue Mem0-Mem Sequential
A-Mem-Out Sequential
A-Mem-Mem Sequential
Long-Form Sequential Sequential
Generation Parallel Parallel

Each setting is instantiated at five context lengths: 8K, 16K, 32K, 64K, and 128K tokens, yielding 2400 preference pairs in total. The benchmark formalizes three canonical memory patterns:

  • Sequential
    • m1=Φ(c1)m_1 = \Phi(c_1)
    • mt=Φ(mt1,ct)m_t = \Phi(m_{t-1}, c_t) for t=2,,nt = 2,…,n
    • Outcome=decode(mn){\rm Outcome} = {\rm decode}(m_n)
  • Parallel
    • Input partitioned into groups G1,,GkG_1,…,G_k, memory m(j)m^{(j)} for each group, final output o=g(m(1),,m(k))o = g(m^{(1)},…,m^{(k)})
  • Mixed
    • Composition of parallel block processing followed by sequential fusion and update

3. Evaluation Protocols and Metrics

MemoryRewardBench\texttt{MemoryRewardBench} assesses reward models using two primary evaluation criteria:

  • Type 1 (Outcome-Based): If one trajectory produces a correct answer and the other is incorrect, the RM should select the correct trajectory.
  • Type 2 (Process-Based): When both trajectories yield the correct answer, the RM should prefer the trajectory with cleaner and logically coherent memory updates.

Formal Metrics

  • Judgment Accuracy: Proportion of pairs where the RM’s choice matches the gold-standard “better” trajectory (random guessing = 50%).
  • Consistency Under Position Swap: Each preference pair is presented twice with the order swapped; a robust RM should maintain the same choice, exposing positional bias if accuracy falls below 50%.
  • Constraint-Adherence Curve: For long-form generation tasks, RM accuracy is measured as the proportion of instruction constraints visible is varied, characterizing sensitivity to constraint density.

4. Models Evaluated

A representative sample of the latest proprietary and open-source reward models was assessed:

  • Proprietary: Claude-Opus-4.5, Gemini 3 Pro, Qwen3-Max
  • Open-Source: GLM4.5-106A12B, Qwen2.5-72B, Qwen2.5-7B, Qwen3-235B, Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Llama 3.3-70B-Instruct, Llama 3.1-8B-Instruct

These RMs were tasked with serving as preference judges on the full suite of MemoryRewardBench\texttt{MemoryRewardBench} scenarios.

5. Empirical Findings

Overall Performance

  • Proprietary models: Claude-Opus-4.5 leads with 74.8% accuracy, followed by Gemini3-Pro (71.6%) and Qwen3-Max (67.8%).
  • Open-source: GLM4.5-106A12B (68.2%) outperforms Qwen3-Max and closes the gap with proprietary models.
  • The performance gap is most narrow on long-context reasoning and widens on tasks requiring temporal tracking or dense constraint management.

Model Scale and Generation

  • Accuracy does not correlate monotonically with parameter scale; newer-generation models outperform older models regardless of size. For example, Qwen3-4B (52.4%) surpasses Qwen2.5-7B (38.2%).
  • Larger or newer models (e.g., Qwen3-235B at 66.6%) outperform significant prior-generation models, e.g., Llama 3.3-70B-Instruct (52.9%).

Task and Pattern Difficulty

  • Long-Context Reasoning: Easiest setting, with many RMs exceeding 60% accuracy.
  • Long-Form Generation: Intermediate difficulty, demanding persistent adherence to abstract generation constraints.
  • Multi-Turn Dialogue Understanding: Most challenging; RMs often fail to track correct and comprehensive memory states.
  • Memory Pattern Effects: RMs accurately judge sequential processes (70–80% accuracy) but underperform on parallel and mixed patterns (mid-60%), revealing a predisposition for stepwise causal reasoning.

Biases and Failures

  • Under outcome-based evaluation, RMs are robust (\sim80% accuracy).
  • In process-based evaluation (equal outcomes), RMs are subject to positional bias, preferring the first sample and yielding consistency below 50%.
  • As the number of explicit instructions increases, RM accuracy peaks near 80% at moderate constraint densities (25%), but additional constraints have diminishing or negative effect.
  • Performance remains above random up to 64K context, but both accuracy and positional consistency degrade rapidly at longer contexts (notably 128K tokens), even for large models like Llama 3.3-70B-Instruct.

Memory-Enhancement Signals

Adding semantic tags (A-Mem) to dialogue memory substantially improves RM accuracy, with gains of 10–15% compared to simple untagged summaries.

6. Conclusions and Research Directions

MemoryRewardBench\texttt{MemoryRewardBench} identifies substantive limitations in the current generation of reward models for long-term memory supervision:

  • Persistent process and positional biases undermine process-based learning.
  • RMs are ill-equipped to judge parallel and mixed memory management strategies.
  • Model fragility increases abruptly beyond 64K-token contexts.
  • Dense, multifaceted constraints in generative tasks exceed the granularity current RMs can consistently enforce.

Emerging directions indicated by these findings include the development of sample-order-invariant RMs, process-centric evaluation architectures, coverage of parallel and mixed patterns through synthetic data, hierarchical evaluation for ultra-long contexts, and the systematic use of auxiliary metadata (temporal tags, semantic summaries, or control signals) as guiding factors for RM supervision.

By formalizing a broad suite of memory management challenge scenarios and developing rigorous, scalable metrics for their evaluation, MemoryRewardBench\texttt{MemoryRewardBench} provides both a key diagnostic tool for RM quality and a strategic blueprint for the evolution of memory-centric supervision in LLMs (Tang et al., 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemoryRewardBench.