LoCoMo Benchmark for Long-Term Memory

Updated 7 December 2025

LoCoMo Benchmark is a comprehensive suite assessing long-term conversational memory with rich multi-session dialogues and event-grounded tasks.
It employs rigorous evaluation of QA, event summarization, and multimodal dialogue generation to benchmark both LLM and SLM performance.
The benchmark drives innovation in memory architectures and neuro-symbolic reasoning, setting standards for temporal inference and efficient context retrieval.

LoCoMo Benchmark (“Long-term Conversational Memory”) is a diagnostic and evaluation suite designed to assess the memory and reasoning capabilities of LLMs and small LLMs (SLMs) over very long, multi-session dialogues. Originating from the need to measure the limits of memory-augmented agents and retrieval-augmented generation (RAG) in realistic, complex conversational settings, LoCoMo emphasizes tasks that require cross-session, long-horizon consistency, temporal reasoning, and multimodal understanding, establishing itself as a comprehensive standard for memory-centric dialogue research. The benchmark is used both as a direct evaluation platform and as a foundational resource for specialized derivative benchmarks (e.g., for temporal reasoning), and has catalyzed the development of numerous state-of-the-art memory architectures and distillation techniques.

1. Dataset Construction and Structure

LoCoMo is predicated on extremely long, multi-session conversation logs, each with rich temporal, personal, and event-driven grounding. Dialogues are generated between LLM-architected virtual agents, seeded with multi-sentence personas and causal, temporally organized event graphs encompassing up to 25 events over 6–12 months. Each dialogue spans up to 32 sessions, with an average of ≈600 turns (≈16,000 tokens), and incorporates images (via web search and captioning) into the conversational context (Bini et al., 4 Dec 2025, Maharana et al., 2024).

Constructed using a hybrid machine–human pipeline, the data undergoes human verification and correction for coherence, causal consistency, and event grounding. Salient facts, preferences, and event observations are distilled throughout, enabling fine-grained annotation across sessions. Each dataset instance records, for each turn: session-level summaries, temporal metadata, long-term episodic memory objects, and images with captions.

Core dataset statistics (original release):

Statistic	Value
# Dialogues	10 (MemLoRA), 50–100 (full/augmented LoCoMo)
Avg. sessions per dialogue	≈19–32
Avg. turns per dialogue	≈305–600
Avg. tokens per dialogue	≈9,000–16,000
Avg. images per dialogue	≈32

2. Task Suite and Formal Definitions

LoCoMo comprises three principal evaluation tasks:

Question Answering (QA): Given the entire multi-session conversation history, a query referencing (potentially distant) facts or events is posed. QA is subdivided into single-hop (intra-session), multi-hop (cross-session), temporal reasoning (date/order/interval inference), open-domain, and adversarial (unanswerable) categories (Maharana et al., 2024, Bini et al., 4 Dec 2025, Patel et al., 17 Nov 2025).
Event Summarization: Models are asked to generate a textual summary of all events in a given time window, necessitating retrieval and aggregation across sessions and event graph nodes.
Multi-Modal Dialogue Generation: Requires models to predict the next dialogue turn, including both text and associated images, with metrics for text quality and image-text semantic alignment.

Input formats are tailored per task: full dialogue history for non-memory systems; retrieved timeline summaries (paired with inferred dates) for memory-augmented models; and, for VQA, native images rather than captions.

3. Evaluation Protocols and Metrics

LoCoMo employs rigorous, multi-faceted evaluation metrics to capture both surface-level and deep semantic performance:

QA:
- Token-level F1: $\mathrm{F1} = \frac{2\cdot\mathrm{Prec}\cdot\mathrm{Rec}}{\mathrm{Prec} + \mathrm{Rec}}$
- LLM-as-Judge (J): Percentage of answers judged CORRECT by a high-capacity LLM (e.g., GPT-OSS-120B) when compared to gold labels.
- Composite Metric (L): Averaged over ROUGE-1 recall, METEOR, BERTScore-F1, and SentenceBERT similarity, $L = (R1 + M + B1 + S) / 4$ (Bini et al., 4 Dec 2025).
Unanswerable Detection: Precision, recall, and F1 for unanswerable items (SQuAD 2.0-style).
Event Summarization: ROUGE scores and FactScore (atomic fact F1).
Multimodal QA & Dialogue Generation: BLEU, ROUGE-L, and MM-Relevance metrics for image-text pairs.
Efficiency for Memory-Architectures: Token counts (input and reasoning), and latency (median and tail).

The QA benchmark specifies distinct splits: 70% train, 10% validation, and 20% test, with strict no-leakage protocols.

4. Baseline Architectures and Memory Pipelines

LoCoMo catalyzed the design of multi-stage memory systems, typified by the Mem0 pipeline:

Knowledge Extraction: Extract salient facts, preferences, or events from dialogue history per turn.
Memory Update: Integrate newly extracted elements into persistent memory via explicit ADD/UPDATE/DELETE/NONE commands.
Memory-Augmented Generation: Retrieve the top- $k$ relevant facts for each query and prepend them to the prompt for answer generation (Bini et al., 4 Dec 2025).

This paradigm supports both LLMs and SLMs, with further specialization through LoRA-adapted models (MemLoRA), compression via expert distillation, and vision-language integration (MemLoRA-V). RAG-based baselines also appear, leveraging fine-grained “observation” retrieval for improved context selection.

Recent memory-augmented inference frameworks (e.g., ENGRAM-R) have advanced efficiency via compressed fact card retrieval and citation-enforced reasoning, achieving up to 85–88% reduction in prompt tokens and 72–75% reduction in reasoning tokens without significant accuracy loss (Patel et al., 17 Nov 2025).

5. Specialized Benchmark Extensions

Several derivatives and extensions of LoCoMo have emerged:

Temporal Reasoning (TReMu): A LoCoMo-based benchmark dedicated to complex temporal inference in multi-session dialogue. It introduces three core temporal QA tasks—Temporal Anchoring, Temporal Precedence, and Temporal Interval—formalized as multiple-choice questions, with human-verified gold answers. TReMu benchmarks memory-augmented and neuro-symbolic (LLM+code) reasoning approaches, demonstrating that time-aware timeline summarization dramatically improves LLM performance, particularly when integrated with code-generated date computations. Performance on temporal tasks remains nontrivial: standard GPT-4o achieves only ~30% accuracy, while full neuro-symbolic approaches reach ~78%, with major errors persisting for ambiguous relative-time expressions and rare execution failures (Ge et al., 3 Feb 2025).
LoCoMo-V: A multimodal extension that introduces native visual question answering (VQA) tasks, requiring direct image-based reasoning rather than caption recall. Three new single-word VQA subtasks (object counting, color identification, unusual object detection) are automatically generated and validated using expert VLMs. MemLoRA-V, which combines SLMs with vision experts, yields VQA accuracy up to 81.3%, outperforming caption-based and non-adapter VLM approaches (Bini et al., 4 Dec 2025).

6. Empirical Results, Usability, and Limitations

Baseline performance underscores the inherent difficulty of LoCoMo:

Text QA: SLMs (2B) with MemLoRA adapters, distilled from 27B-120B models, can match or surpass much larger models (2B+MemLoRA J=47.2 vs. 27B J=39.1, 120B J=48.9) (Bini et al., 4 Dec 2025).
Temporal QA: CoT and timeline summarization boost accuracy to ~72%, with TReMu’s neuro-symbolic pipeline pushing this to ~78% (Ge et al., 3 Feb 2025).
Efficiency: Memory-augmented systems like ENGRAM-R achieve large token and latency reductions while maintaining high retrieval and answer quality (Patel et al., 17 Nov 2025).
Multimodal: Caption-only systems score ~23% on VQA, compared to >80% for MemLoRA-V (Bini et al., 4 Dec 2025).

Notable limitations include LoCoMo’s inability (in its base form) to evaluate native visual reasoning (addressed by LoCoMo-V) and the challenge of designing composite metrics that fully capture factual correctness in long-horizon inference. Human performance on core QA tasks remains substantially above current LLM or SLM baselines (human QA F1 ≈88%) (Maharana et al., 2024).

7. Impact and Future Directions

LoCoMo has become a primary testbed for evaluating memory systems and efficient long-context retrieval in conversational agents. The benchmark has directly prompted innovations in memory architectures, on-device model compression/distillation, and neuro-symbolic reasoning. Extensions such as TReMu and LoCoMo-V reflect the benchmark’s adaptability in exploring new frontiers, including complex temporal logic and native vision-language reasoning.

Planned directions include the integration of more challenging adversarial and open-domain queries, scalable VQA with real-world photo albums, direct temporal consistency constraints in memory architectures, dynamic per-query memory budgets, and improved event-graph scaffolding. Broader adoption is enabled by open-sourcing the dataset, evaluation scripts, and annotation protocols under a permissive license (Maharana et al., 2024, Bini et al., 4 Dec 2025).