LongMemEval Benchmark
- LongMemEval is a benchmark that defines and evaluates five key memory tasks in LLM chat assistants, including information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and safe abstention.
- It employs a hybrid evaluation protocol with both LLM-based judgment and evidence-grounded retrieval to assess accuracy, recall, and temporal filtering in extended dialogs.
- Benchmark results reveal significant performance gaps in current LLM systems, driving innovations in memory pipeline design and retrieval frameworks for real-world applications.
LongMemEval is a comprehensive benchmark developed to rigorously assess the long-term, interactive memory capabilities of LLM–driven chat assistants. Motivated by the growing deployment of LLM-based agents in real-world applications that require recalling, updating, and reasoning over extended dialog histories across multiple sessions, LongMemEval targets the core technical limitations of memory architectures in sustained, multi-session interactions. It is specifically constructed to expose substantive deficiencies in information extraction, reasoning across conversational timelines, ability to track and update user knowledge, and to simulate realistic, user-assistant interaction patterns with scalable history lengths (Wu et al., 2024).
1. Benchmark Scope, Task Definitions, and Dataset Construction
LongMemEval evaluates five core long-term memory abilities in LLM-driven chat assistants:
- Information Extraction (IE): Recall of facts buried within a session, regardless of which agent introduced them.
- Multi-Session Reasoning (MR): Aggregation or comparison of information scattered over multiple sessions (e.g., "compare my travel plans from two months apart").
- Knowledge Updates (KU): Correct adaptation to dynamic user information (e.g., a changed address superseding old values).
- Temporal Reasoning (TR): Correct interpretation and use of explicit or implicit temporal references (e.g., "the last time we talked about X…").
- Abstention (ABS): The ability to decline to answer when information was never mentioned in the history (false premises).
The LongMemEval dataset comprises 500 question instances, each with detailed evidence decomposition (1–6 evidence snippets per question, with optional timestamps), and embedded within user–assistant dialogues. Each dialogue is constructed by LLM simulation and human refinement, producing 8–10-turn sessions per evidence statement. The evidence-bearing sessions are then hidden within a configurable number of unrelated, distractor session "haystacks." Two primary configurations are used: LongMemEvalS (≈115K tokens, ≈50 sessions) and LongMemEvalM (≈1.5M tokens, ≈500 sessions) (Wu et al., 2024).
2. Evaluation Protocols and Metrics
LongMemEval employs a hybrid evaluation paradigm combining LLM-based judgment with evidence-grounded retrieval for systematic, reproducible assessment:
- Question Answering Accuracy: Model-generated answers are scored by an LLM judge (GPT-4O) to determine semantic correctness. Agreement with human judges exceeds 97%. Formally, accuracy is
- Memory Recall: If the system exposes its memory retrievals, Recall@k and NDCG@k are reported, measuring whether ground-truth evidence appears in the retrieved set.
$\mathrm{Recall}@k = \frac{1}{N} \sum_{i=1}^N \mathbbm{1}\{\text{evidence in top-}k\}$
- Temporal Filtering: For TR, auxiliary LLMs determine relevant time windows, boosting temporal recall by up to 11.4% at Recall@5 when filtering histores by inferred date ranges.
This explicit separation of retrieval and answer generation, as well as fine-grained correctness definition (including abstention scoring), distinguishes LongMemEval from prior benchmarks.
3. Baselines, Observed System Performance, and Empirical Gaps
LongMemEval reveals that neither commercial nor open-source LLM systems demonstrate robust long-term memory over extended interaction sequences:
- Commercial Assistants:
- On short histories (3–6 sessions), ChatGPT (GPT-4O) drops from 91.8% accuracy with full (offline) reading to 57.7% in online settings.
- Coze (GPT-4O) achieves only 32.9% online.
- Capability breakdown (approximate): IE (80–100%), MR (15–65%), KU (20–85%), TR (5–65%).
- Common failure: knowledge overwrite/compression of older facts, or omission of facts never directly mentioned on user side (Wu et al., 2024).
- Long-Context LLMs:
- When scaling up to 115K-token histories (LongMemEvalS), observed 30–45% absolute accuracy drops are typical. For GPT-4O, full-context reading yields 60.6–64% (vs. 87–92% oracle).
- Llama 3.1-8B: 42–45% (vs. 71% oracle).
- Phi-3-Medium: 34–38% (vs. 70–72% oracle).
- The drop is especially acute for multi-session reasoning and knowledge updates (Wu et al., 2024).
- Memory-Augmented Architectures:
- Systems such as Zep (temporal knowledge graph), Mem0, MemoryOS, EverMemOS, and TiMem are evaluated extensively (see below).
- Key finding: Optimized memory pipelines (e.g., round-level value granularity, fact-augmented keys, temporal query expansion) can recover 5–10 percentage points of accuracy. However, no system achieves near-oracle performance on multi-session or temporal tasks.
- Ablation studies confirm that engineering each pipeline stage (value, key, query expansion, downstream reading) adds 3–7% individually (Wu et al., 2024).
4. Architectures, Retrieval Frameworks, and Design Insights
LongMemEval facilitated the emergence and assessment of several unified memory frameworks for chat assistants:
- Three-Stage Pipeline: Any memory-augmented assistant is operationalized as a succession of (1) Indexing (value and key design), (2) Retrieval (query expansion, semantic matching), and (3) Reading (answer extraction and aggregation).
- Key Optimizations:
- Round-level value granularity (vs. entire session) improves QA accuracy up to 6% under fixed token budgets.
- Fact-augmented key expansion, where keyphrases or user facts supplement the raw key, increases recall by 4% and QA accuracy by 5%.
- Time-aware query expansion for temporal questions filters irrelevant memory slices, adding up to 11.4% recall gain for those sub-tasks.
- Ablation and Reading Strategies: The use of JSON-structured input and chain-of-note reading protocols yield further improvements (>10%). All reported improvements are statistically significant (paired bootstrap, p<0.01) (Wu et al., 2024).
A synthesized formalization is:
- At time , ingest session as a collection of tuples , where is the memory key (possibly fact-augmented), is the recallable value, is timestamp.
- To answer query , first infer relevant time window (for TR), filter to those , then score by and return top- for LLM-based answer synthesis.
5. Extended System Studies on LongMemEval
Several leading retrieval-based and compositional memory systems have been benchmarked on LongMemEval, revealing persistent bottlenecks and areas for future development:
- Zep/Graphiti: Temporal knowledge graph instance, achieving 71.2% (gpt-4o) overall accuracy with 2.6s latency versus 60.2% for vanilla full-context at 29s. Temporal-reasoning and multi-session QA see large relative improvements (17.3pp and 13.6pp) (Rasmussen et al., 20 Jan 2025).
- TiMem: Temporal hierarchical memory (Temporal Memory Tree) achieves state-of-the-art 76.88% (GPT-4o-mini) on LongMemEval-S, reducing memory footprint by 27% relative to other systems. Gains are most pronounced in knowledge update (+9.49pp) and multi-session reasoning (+12.03pp) (Li et al., 6 Jan 2026).
- EverMemOS: Engram-inspired, three-phase (episodic, semantic, reconstructive) memory achieves 83.0% overall accuracy (outperforming MemOS by +5.2pp), with largest gains on knowledge update (+15.5pp) and assistant-actions (+17.9pp) (Hu et al., 5 Jan 2026).
- RMM: Reflective Memory Management yields +10pp accuracy improvement over baseline retrieval, through a combination of prospective session-level topic summarization and RL-guided reranking for adaptive recall (Tan et al., 11 Mar 2025).
- ENGRAM-R: Typed retrieval with compact fact-card citation achieves 59.8% overall (vs. 38.0% full-context) while reducing input tokens by >95% and reasoning tokens by >77% (Patel et al., 17 Nov 2025).
- Recall-with-Reasoning (RwR): Chain-of-thought distillation for Mamba SSMs boosts long-memory accuracy from 8.0%→11.4% at 100K context length on LongMemEval (Ma et al., 6 May 2025).
6. Key Findings, Analytical Trends, and Open Technical Problems
Empirical analysis of LongMemEval yields several non-obvious findings with industry and research implications:
- High retrieval accuracy does not imply robust editing, state tracking, or temporal awareness. Overreliance on classical needle-in-haystack search overestimates true memory capacity.
- Error modes are highly task-specific—multi-session and temporal questions see pronounced collapse (often <50% accuracy), with persistent false positive and negative biases in retrieval.
- Ablation and ablation-unified comparisons reveal that naive context truncation and token-budget solutions do not scale; multi-session composition and update-tracking remain open algorithmic problems (Wu et al., 2024, Li et al., 6 Jan 2026, Hu et al., 1 Feb 2026, Hu et al., 5 Jan 2026).
- Memory system efficiency—measured as end-to-end latency and retrieval-context size—emerges as a critical constraint for real-world deployment. Sophisticated memory KGs (e.g., Zep, EverMemOS) deliver both higher accuracy and sub-3 second response time by condensing recall to 1–3K tokens (Rasmussen et al., 20 Jan 2025, Hu et al., 5 Jan 2026).
- Limitations are acute in online and multimodal settings, with real-time update incorporation, cross-modal content, and open-ended knowledge-grounding identified as priorities for future benchmarking (Wu et al., 2024, Tan et al., 11 Mar 2025).
7. Impact, Reproducibility, and Future Directions
LongMemEval is widely adopted as a rigorous standard for evaluating long-term memory in conversational AI, driving innovation in:
- Memory pipeline design: informing best practices for value granularity, fact-based indexing, and time-filtering.
- Real-world system deployment: demonstrating that modern chat assistants (commercial and open-source) can experience ≥30pp accuracy collapse under realistic, large-scale dialog histories when unaugmented by advanced memory modules.
- Algorithmic development: catalyzing new families of memory controllers, RL-based retrievers, temporal/hierarchical consolidation, and compositional reasoning architectures that demonstrably close gaps on diagnostic tasks (Wu et al., 2024, Rasmussen et al., 20 Jan 2025, Li et al., 6 Jan 2026, Hu et al., 5 Jan 2026, Tan et al., 11 Mar 2025).
Ongoing challenges include end-to-end memory architecture co-training (retriever+reader), scaling to multi-modal and web-grounded histories, support for continual learning across user populations, and principled, contractible evaluation protocols under adversarial or noisy conditions. The public availability of LongMemEval (github.com/xiaowu0162/LongMemEval) supports broad community engagement and extension.
References: (Wu et al., 2024, Rasmussen et al., 20 Jan 2025, Li et al., 6 Jan 2026, Hu et al., 5 Jan 2026, Tan et al., 11 Mar 2025, Patel et al., 17 Nov 2025, Ma et al., 6 May 2025, Hu et al., 1 Feb 2026)