MemoryBench: LLM Memory & Continual Learning

Updated 5 November 2025

MemoryBench is a benchmark that evaluates LLM memory and continual learning by simulating dynamic user feedback across diverse domains and languages.
It introduces robust protocols to measure both declarative and procedural memory, incorporating realistic, interactive feedback cycles.
The framework offers open-source code and datasets, revealing challenges in efficiency, generalizability, and the integration of user feedback in LLM systems.

MemoryBench is a comprehensive benchmark for evaluating memory and continual learning capabilities in LLM systems (LLMsys), with a particular focus on assessing the ability to accumulate and exploit user feedback over time. It addresses critical shortcomings in previous LLM memory benchmarks by introducing a user feedback simulation framework, expanding evaluation across diverse domains and languages, and defining robust methodologies for measuring both declarative and procedural memory. MemoryBench is designed to reflect real-world application settings where continual adaptation from user interactions is essential, with all code, datasets, and protocols released open source for reproducibility and extension (Ai et al., 20 Oct 2025).

1. Motivation and Rationale

LLMsys have historically relied on scaling data, model parameters, and test-time computation for performance improvements; however, such gains are saturating due to high-quality data depletion and diminishing computational returns (scaling law limits). MemoryBench is motivated by the need to support continual learning: the process by which LLMsys improve after deployment by integrating accumulated user feedback—echoing both human learning and classical information retrieval systems. Existing benchmarks inadequately evaluate this direction because they focus on static, homogeneous tasks (e.g., reading comprehension, long-form QA), neglect procedural memory (learning from feedback during service), and fail to capture the full diversity of LLM use cases.

2. Overcoming Prior Benchmark Limitations

MemoryBench directly addresses gaps in former LLM memory benchmarks:

Prior tools largely operate in static settings, measuring only retrieval from pre-supplied data (semantic/episodic memory), not dynamic learning.
They emphasize homogeneous task distributions (e.g., reading comprehension) and do not engage with procedural memory—adaptive change based on user feedback after deployment.
Existing evaluations cover a narrow spectrum of task types and languages, rarely simulating diverse, realistic LLM applications or actual user feedback cycles.

MemoryBench introduces:

Dynamic, feedback-rich continual learning tasks.
Realistic user feedback simulation using a dedicated User Simulator (explicit, action-based, and implicit feedback).
Multi-domain, multilingual, and varied task coverage.
Evaluation protocols for both declarative and procedural memory.

3. System Architecture and Feedback Simulation

MemoryBench’s architecture comprises three core modules:

Task Provider: Supplies query $q$ , context $c$ , and evaluation metadata $v$ for training (feedback simulation) and test (evaluation) splits.
User Simulator:
- Explicit feedback: Natural language critiques, satisfaction scores, follow-up responses via LLM-as-user.
- Action feedback: Simulated user actions (like/dislike/copy/no-action), with probabilities mapped from satisfaction scores using empirically-driven logistic models calibrated against real-world deployments (e.g., Blenderbot 3).
- Implicit feedback: Behavioral simulations such as copying or continued session.
- Operations are split:
  - Fact-based tasks: Deterministic mapping from metrics (F1, accuracy) to feedback.
  - Open-ended tasks: LLM-generated user critique and behavioral modeling.
Performance Monitor: Measures LLMsys output on test sets with dataset-native metrics, optionally aggregating results via LLM-as-Judge for composite multi-metric tasks.

Procedural memory is defined operationally as a sequence of feedback logs ( $S_t$ ) accumulated per training interaction, including explicit and implicit signals. At test-time, the agent’s performance depends on both initial task context and the accumulated procedural memory.

Formalization: $\text{Find } \theta_t, M_t \text{ at time } t \text{ based on } S_{t-1} = \{s(q_i, f(\theta_i,M_i,q_i))\}_{i=1}^{t-1}, \quad \min l(q_t, f(\theta_t, M_t, q_t))$

The user action simulation uses: $P(L|S) = c_L \cdot \sigma(k_L(S - S_{0L})), \quad P(D|S) = c_D \cdot \sigma(-k_D(S - S_{0D}))$ where $\sigma(x) = (1 + e^{-x})^{-1}$ .

4. Coverage: Domains, Languages, and Task Types

MemoryBench is explicitly heterogeneous. Its corpus encompasses:

Domains: Open-domain, Academic, Legal.
Languages: English (en), Chinese (zh).
Task Types: Based on input/output length (600-token threshold):
- LiSo (Long-Input/Short-Output)
- SiLo (Short-Input/Long-Output)
- LiLo (Long-Input/Long-Output)
- SiSo (Short-Input/Short-Output)

Datasets: 11 distinct public benchmarks (20,000 cases), including Locomo, DialSim, LexEval, IdeaBench, LimitGen-Syn, WritingPrompts, HelloBench, WritingBench, JuDGE, NF-Cats, SciTechNews. Tasks span reading comprehension, writing/code generation, creativity, legal judgment, summarization, and more. Datasets are reformatted for maximal domain/task/language diversity.

5. Evaluation Methodology and Scoring

MemoryBench implements two continual learning protocols:

Off-policy: Exposure to historical (training) data and simulated feedback logs; evaluated on a holdout test set.
On-policy: Sequential feedback collection and agent memory/parameter updating during interactive simulation.

Metrics:

Native per-dataset metrics: F1, accuracy, ROUGE-L, METEOR, BERTScore, etc.
Multi-metric aggregation: For complex tasks, LLM-as-Judge produces a 1–10 integer score, enabling relative comparison across tasks and domains.
Normalization: Results are min-max normalized or standardized to z-scores for aggregation and unbiased comparison.

Open-source: Full release of code, protocols, and simulation datasets on GitHub/HuggingFace for transparency and community extension.

6. Experimental Findings and Implications

Baseline evaluations include:

Naive RAG (BM25, Qwen3-Embedding)
Memory-augmented LLMsys: A-Mem, Mem0, MemoryOS
Supervised fine-tuning (SFT) for action feedback
Vanilla (no explicit memory or feedback integration)

Key observations:

Procedural memory is generally valuable—LLMsys that utilize simulated user feedback (explicit + implicit) consistently outperform vanilla counterparts.
Memory system generalization is fragile—specialized architectures (A-Mem, Mem0, MemoryOS) are only robustly superior on reading comprehension/long-context retrieval; in other domains, naïve retrieval-based approaches (well-tuned RAG) are competitive or superior.
Efficiency bottlenecks are severe—several memory-augmented methods exhibit extremely high time-per-case for memory operations and inference; scaling procedural memory can become prohibitive as feedback logs grow.
Domain/task sensitivity—no memory approach achieves across-the-board dominance. Writing generation, legal, creative, and open-ended tasks are often served as well or better by optimized retrieval/fine-tuning baselines. On reading comprehension, advanced memory integration offers advantages.
Generalizability is unproven—most prior evaluations use homogeneous, single-domain tasks. In MemoryBench, state-of-the-art approaches often fail to generalize across diverse task, domain, and feedback settings.
Action feedback effectiveness—in action feedback experiments (“like”/“copy”), RAG and SFT baselines are competitive, indicating that mere access to feedback logs is insufficient for robust continual learning.

7. Research Implications and Future Directions

Results from MemoryBench delineate urgent priorities in memory-centric and continual learning research for LLMsys:

Realism and breadth in evaluation—Synthetic or homogeneous benchmarks are inadequate for revealing the weaknesses of state-of-the-art continual learning or memory systems.
Algorithmic innovation—There is a critical need for memory and continual learning algorithms (both parametric and non-parametric) that can leverage feedback without succumbing to noise amplification, efficiency breakdown, or overfitting.
Prioritizing efficiency and scalability—Memory architectures must address operational cost: time, memory overhead, and update latency, particularly as feedback accumulates.
Selective memory retrieval/feedback integration—Noise, redundancy, and context mixing in feedback logs challenge existing approaches; robust filtering and selective memory update are necessary.
Foundations for lifelong, heterogeneous continual learning agents—Benchmarks and algorithms must support agents capable of adapting in highly diverse, unpredictable online settings.
Open, reproducible platforms—Comprehensive release of data, protocols, and analysis frameworks accelerates progress and reproducibility across the research community.

MemoryBench thus provides a foundation for rigorous, realistic, and broad-based evaluation of LLM memory and continual learning—a critical step for advancing adaptive, user-feedback-efficient, and scalable LLM systems.

PDF Markdown Chat (Pro)

References (1)

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MemoryBench.