LoCoMo and LongMemEval_S Benchmarks

Updated 28 November 2025

The paper introduces comprehensive benchmarks that assess LLM long-term memory retention and multi-session reasoning through detailed QA accuracy and efficiency metrics.
LoCoMo and LongMemEval_S are defined by structured datasets featuring multi-modal dialogues and diverse reasoning challenges such as temporal ordering and multi-hop inference.
The benchmarks drive advances in memory modularization, hierarchical retrieval, and event-centric frameworks, highlighting trade-offs between recall and efficiency.

LoCoMo and LongMemEval $_S$ are leading benchmarks for evaluating the long-term memory and reasoning capabilities of LLMs in extended, multi-session conversational settings. Both benchmarks probe not only factual retention but complex temporal, multi-hop, and cross-session inference in configurations that stress current memory architectures, retrieval schemes, and answer generation protocols. They have become reference datasets in the paper of persistent, agentic memory for LLM-based dialog agents.

1. Dataset Construction and Composition

LoCoMo was initially introduced to address the challenge of evaluating LLMs on very long-term, persona-grounded, multi-modal dialogues (Maharana et al., 27 Feb 2024). Dialogues feature two generative agents with explicit personas and temporal event graphs, spanning up to 35 sessions, 300+ turns (≈9,000 tokens) per conversation, and include text plus image exchanges. Human annotators edit for consistency, grounding, and event coherence. LoCoMo’s fully constructed dataset contains 50 dialogues, averaging 19.3 sessions each.

The associated LoCoMo benchmark (as used in subsequent works) typically comprises between 1,500 and ~2,000 question–answer (QA) pairs (Huang et al., 3 Nov 2025, Zhou, 21 Nov 2025), evenly split among reasoning categories:

Single-hop: direct factual retrieval.
Multi-hop: chaining two or more events/facts across sessions.
Temporal: ordinal or precedence questions, e.g., “When did X do Y?”
Open-domain: requiring external/world knowledge integrated with dialogue content.
Adversarial/unanswerable: designed to probe consistency under misleading/distracting context (Maharana et al., 27 Feb 2024).

LongMemEval $_S$ is a widely adopted variant of the LongMemEval benchmark (Wu et al., 14 Oct 2024), intended as the “standard” (hence the subscript "S") 500-question suite for scalable multi-session dialogue memory assessment (Huang et al., 3 Nov 2025, Zhou, 21 Nov 2025). Each instance comprises ≈115,000 tokens of user–assistant chat history over ∼50 sessions. Questions are distributed across six types: single-session-user, single-session-assistant, single-session-preference, multi-session (cross-session reasoning), temporal reasoning, and knowledge update.

There is no distinct “LongMemEval_S” slice in core benchmark sources; rather, LongMemEval $_S$ denotes the standard, full 500-item evaluation used in almost all recent experiments (Huang et al., 3 Nov 2025, Zhou, 21 Nov 2025, Patel et al., 17 Nov 2025).

2. Evaluation Protocols and Key Metrics

Benchmarks use LLM-judged answer quality and retrieval metrics tailored to long-horizon dialogue:

Primary metrics (Huang et al., 3 Nov 2025, Wu et al., 14 Oct 2024, Zhou, 21 Nov 2025):

QA Accuracy (Acc.): fraction of correct answers as determined by a strong LLM grader (e.g., GPT-4o), formalized as $\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^N \delta_i$ , where $\delta_i=1$ iff the $i$ th answer is correct.
Recall@k: fraction where the ground-truth evidence appears in the top- $k$ retrieved memory units, typically $k=15$ for LoCoMo/LongMemEval, $k=5$ /$10$ for ablations.
QA F1: harmonic mean of accuracy and recall, or token-level overlap for partial-match evaluation (formally $F_1 = \frac{2\,P\,R}{P+R}$ using answer span tokens).
BLEU-1 and ROUGE: for summarization and multi-modal tasks (LoCoMo only).
Efficiency: total input tokens per query ( $K$ ), reasoning tokens (chain-of-thought length), and wall-clock query latency ( $T_R$ ).

Category-specific metrics address long-horizon reasoning by isolating multi-session, temporal, and knowledge-update tasks (Wu et al., 14 Oct 2024, Huang et al., 3 Nov 2025).

3. Memory and Retrieval Frameworks

The benchmarks drive research on memory-centric LLM architectures, retrieval, and update methods:

Flat RAG: Dense retrieval over raw session or chunk memories.
Hierarchical/Graph-based: Multi-level memory (LiCoMemory’s CogniGraph), event-centric graphs of elementary discourse units (EMem/EMem-G) (Huang et al., 3 Nov 2025, Zhou, 21 Nov 2025).
Temporal-aware filters: Query expansion, temporal decay weighting, and context compression (e.g., Weibull decay $w(\Delta\tau) = \exp[-(\Delta\tau/\hat\tau)^k]$ for recency modulation).
Fact card construction and controlled citation: Inline justification linking in model output to specific retrieved memory units (ENGRAM-R, (Patel et al., 17 Nov 2025)).

Optimizations include session decomposition for retrieval granularity, fact-augmented key expansion (fact lists as retrieval keys), time-aware query expansion (temporal constraints), and structured reading (chain-of-note, JSON augmentation) (Wu et al., 14 Oct 2024).

4. Benchmark Results and Baseline Comparison

Result trends across LoCoMo and LongMemEval $_S$ :

Event-centric memory (EMem, EMem-G) (Zhou, 21 Nov 2025): QA accuracy of 76–85% (gpt-4o-mini, gpt-4.1-mini backbones), with especially strong gains in temporal and multi-session categories.
LiCoMemory (Huang et al., 3 Nov 2025): Outperforms other graph and fact-extraction baselines (Mem0, Zep, A-Mem, LoCoMo-RAG) across both benchmarks, achieving 73.8% accuracy/76.6% recall on LongMemEval (GPT-4o-mini backbone) and consistently reducing latency by 10–40%. Largest margin improvements are in multi-session (+26.6 pp Acc.) and temporal (+15.9 pp Acc.) subsets.
ENGRAM-R (Patel et al., 17 Nov 2025): On LongMemEval $_S$ , reduces input token budget by 95.5%, reasoning tokens by 77.8%, and improves accuracy by +21.8 pp overall (notably +30.1 pp in multi-session, +13.5 pp in temporal).
Full-context LLMs: Strong accuracy drop (∼30–55%) when scaling to full dialogue history. RAG over observations lifts QA F $_1$ but saturates quickly with context dilution (Wu et al., 14 Oct 2024, Maharana et al., 27 Feb 2024).
Best-practice findings from ablations: recall-oriented LLM filters, event-centric memory, hierarchical graph structures, and fact-card citation yield the greatest robustness on long-horizon benchmarks.

Model/Method	LoCoMo Accuracy	LongMemEval $_S$ Accuracy	Retrieval Efficiency	Temporal/Multi-Session Gains
Full-context LLM	∼30–37%	∼38–40%	N/A	Poor
Hierarchical (e.g., LiCoMemory)	∼67–73%	∼69–76%	✓	+15–26 pp
Event-centric (EMem-G)	78–85%	77–85%	✓	✓
ENGRAM-R	76%	60% (overall slice)	Superior	+13–30 pp

All numbers drawn explicitly from benchmark leaderboards in (Huang et al., 3 Nov 2025, Patel et al., 17 Nov 2025, Zhou, 21 Nov 2025, Wu et al., 14 Oct 2024).

5. Core Challenges and Error Modes

Analysis across studies highlights persistent challenges:

Temporal Reasoning: LLMs are consistently weakest when temporal disambiguation, event ordering, or recency are required. Even with optimized retrieval or long context, temporal accuracy lags by ≥50 pp versus humans (Maharana et al., 27 Feb 2024, Wu et al., 14 Oct 2024).
Multi-session Aggregation: Chaining facts or resolving state changes over distant sessions results in substantial error, especially for generic RAG or full-history approaches.
Update and Consistency: Detecting entity attribute changes and resisting distraction from irrelevant, stale, or contradictory evidence is nontrivial. Memory frameworks with hierarchical and update-tracking logic (LiCoMemory, EMem-G) outperform flat or purely dense retrieval approaches.
Context “noise” and dilution: Larger retrieval windows or chunk-based RAG lead to accuracy saturation or degradation as more unrelated content is introduced (Wu et al., 14 Oct 2024).
Efficiency vs. Quality trade-off: Practical inference requires compact, information-rich evidence pools; best results use ≤1,000 tokens for QA input from memory, far below full-history context (Patel et al., 17 Nov 2025, Zhou, 21 Nov 2025).

6. Practical Implications and Methodological Trends

Long-term conversational memory benchmarks have catalyzed methodological advances:

Memory modularization: Three-stage frameworks splitting indexing, retrieval, and reading stages with explicit control points for value chunking, keying, and query transformation are now standard (Wu et al., 14 Oct 2024).
Recall-biasing: LLM-based filters tuned for high recall outperform manual or score-based pruning.
Temporal and event representations: Structured memory arranges atomic facts/events with timestamps and causal/participant links to enable associative and time-sensitive retrieval.
Structured prompts and answer generation: Forcing citation, structured answer formats (JSON, chain-of-note), and answer rationales improves answer faithfulness and efficiency.
Ablation studies: Removal of LLM-based filters or structured event representations causes 4–8 percentage point drops in QA accuracy, underscoring their necessity (Zhou, 21 Nov 2025).

A plausible implication is that future systems will require highly structured, timestamped, event-oriented persistent memories tightly integrated with dense and symbolic retrieval, along with adaptive answer construction modules.

7. Benchmark Evolution and Limitations

LoCoMo and LongMemEval $_S$ continue to evolve:

Multi-modal and world knowledge integration: LoCoMo explicitly incorporates images and open-domain reasoning, but persistent performance gaps remain in multi-modal and adversarial categories (Maharana et al., 27 Feb 2024).
No true “LongMemEval_S” split: The nomenclature “LongMemEval_S” is best interpreted as the canonical 500-question suite, not a distinct subset or configuration (Huang et al., 3 Nov 2025).
Benchmark saturation: Human-level performance remains a distant target, especially in temporal/multi-session categories (temporal: 20.3 for GPT-3.5-turbo-16K vs 92.6 for humans; adversarial: 2.1 for LLMs vs 89.4 for humans (Maharana et al., 27 Feb 2024)).
Cross-paper consistency: Recent studies have synchronized protocols and categories, ensuring comparability across graph/event-centric and fact-extraction methods.
Enduring error lines: Long-horizon dialogue reasoning, state tracking, and multi-modal integration are ongoing bottlenecks for current architectures.

These benchmarks are now indispensable for research into scalable, reliable, and temporally sophisticated conversational agents and memory-augmented LLMs. Their impact is evident in the rapid dissemination and benchmarking of agentic memory architectures in the dialog systems literature (Huang et al., 3 Nov 2025, Zhou, 21 Nov 2025, Wu et al., 14 Oct 2024, Patel et al., 17 Nov 2025, Maharana et al., 27 Feb 2024).