LOCOMO Benchmark: Long-Horizon Memory in LLMs

Updated 31 December 2025

LOCOMO Benchmark is a comprehensive evaluation suite that tests memory-augmented LLMs on extended multi-session dialogues and multimodal tasks.
It employs synthetic, human-verified dialogues with persona-based contexts and temporal event graphs to assess recall, multi-hop reasoning, and structured memory retrieval.
The benchmark drives advances in memory system architectures, demonstrating significant gains in judge accuracy, retrieval precision, and input token reduction.

The LOCOMO benchmark (“LOCk-In to LOng COntext MOnitoring”) is a rigorous evaluation suite for testing the capabilities of AI agents—particularly memory-augmented LLMs—on very long-term, multi-session conversational tasks that require storage, retrieval, and reasoning over extended dialogue history. LOCOMO isolates the long-horizon challenges faced by conversational agents, including recall of factual details, reasoning across multiple turns and sessions, and robust retrieval from structured memory. The benchmark spans text-only and multimodal tasks, is based on synthetic but human-verified dialogue, and is central to recent advances in LLM-agent memory systems, retrieval frameworks, and efficient inference mechanisms (Maharana et al., 2024, Bini et al., 4 Dec 2025, Wang et al., 10 Jul 2025, Patel et al., 17 Nov 2025).

1. Dataset Construction and Structure

LOCOMO consists of synthetic, human-verified multi-session dialogues, each grounded in explicit persona descriptions and temporal event graphs. Two agent roles are established per dialogue, each with:

Persona summaries (expanded from seed mini-personae through LLM generation)
Temporal event graph $\mathcal{G} = (E, L)$ : $|E| ≈ 24$ events over 6–12 months, causal/temporal links $L$ , event timestamped
Dialogue turns: Average $600$ turns per transcript ( $\approx 16,000$ –$26,000$ tokens), up to $35$ sessions/conversation

Dialogues feature image-sharing and image-reaction routines. Machine-generation is followed by human annotation to ensure long-range consistency, event alignment, and removal of contradictions ( $\sim$ 15% of turns and 19% of images are edited).

Question Types:

Each dialogue is accompanied by $\sim$ 200 QA items spanning:

Single-hop fact lookup: Direct factual retrieval (“What city did X live in?”)
Multi-hop reasoning: Chains two or more facts (“Where did X move from four years ago?”)
Temporal ordering: Event ordering (“When did Y occur relative to Z?”)
Open-domain inference: Questions requiring background or abstract reasoning (“What might happen if…?”)
Adversarial/unanswerable: Intentionally ambiguous or underspecified (excluded in most recent system comparisons)

A subset of the benchmark is extended for multimodal reasoning with images and visual question answering (VQA).

2. Tasks and Evaluation Protocols

LOCOMO is structured into three primary memory-centric tasks:

Knowledge Extraction: Identify and output relevant facts (preferences, dates, events) from a dialogue in structured (JSON) form
Memory Update: Given new facts and the current memory store, decide for each whether to ADD, UPDATE, DELETE, or NONE
Memory-Augmented Generation: Answer user queries with access only to retrieved relevant memory snippets

The multi-task arrangement mirrors practical agent architectures such as Mem0 and MIRIX (Bini et al., 4 Dec 2025, Wang et al., 10 Jul 2025), where each incoming utterance may trigger memory updates and complex retrieval.

Text and Multimodal Extensions:

VQA tasks (LOCOMO-V) require reasoning over conversational images, not merely BLIP captions; queries probe counting, color identification, and object recognition.

3. Metrics and Baseline Systems

Evaluation spans both surface-level and semantic measures:

Token-overlap F1: Precision and recall over generated vs. ground truth tokens; average $F1$ in QA tasks across $N$ examples:

$\text{F1} = \frac{1}{N} \sum_{i=1}^N \frac{2 \cdot \text{Precision}_i \cdot \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i}$

LLM-as-a-Judge Accuracy: For each QA, an external LLM marks answers as correct/wrong; aggregate accuracy is:

$\text{Accuracy} = \frac{1}{|Q|} \sum_{q \in Q} \delta(q)$

Composite semantic score $L$ (MemLoRA (Bini et al., 4 Dec 2025)): Unweighted mean of ROUGE-1 recall, METEOR, BERTScore-F1, and SentenceBERT similarity:

$L = \frac{1}{4}\,(R + M + B + S)$

Retrieval Recall@ $k$ (for retrieval-based “document” queries): Fraction of times the correct memory entry is among top $k$ retrieved.
VQA Accuracy: Percentage of exact-match answers on image-based questions.

Baselines include full-context LLMs (GPT-3.5-16K, GPT-OSS-120B), retrieval-augmented generation (RAG), and specialized memory architectures (LangMem, Mem0, MIRIX). Recent approaches introduce adapter-based distillation (MemLoRA), privacy-preserving hybrid inference (Socratic CoT + encrypted search), and memory orchestration for efficiency (ENGRAM-R) (Bini et al., 4 Dec 2025, Bae et al., 19 Jun 2025, Patel et al., 17 Nov 2025).

4. Memory System Architectures and Agent Integration

LOCOMO serves as the standard evaluation for advanced memory agent systems. Examples:

MIRIX: Multi-agent memory organizing six typed memory components (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) (Wang et al., 10 Jul 2025). Integrates a Meta Memory Manager that routes both updates and retrievals; Chat Agent triggers topic generation, parallel retrieval, and answer grounding. MIRIX update and retrieval steps (simplified):
- Update: Routing new turn → parallel LLM summarization/extraction → de-duplication → memory modification
- Retrieval: Topic generation $T$ , top- $k$ retrieval per memory component, prompt composition with tagged entries, LLM answer grounded in retrieved snippets
ENGRAM-R (Patel et al., 17 Nov 2025): At inference, replaces dialogue context with a fixed set of “fact cards”; explicit citation control forces evidence-based reasoning; achieves substantial token and latency reduction while maintaining or improving accuracy.
MemLoRA (Bini et al., 4 Dec 2025): Distills memory adapters for knowledge extraction, memory update, and QA generation; small models with expert adapters outperform much larger models.
Socratic CoT + Encrypted Search (Bae et al., 19 Jun 2025): Decomposes QA into sub-queries using remote LLMs; encrypted semantic search is performed locally; final generation combines retrieved documents and chain-of-thought on trusted hardware.

5. Results, Impact, and Comparison Table

Key findings highlight the difficulty of long-horizon consistency for LLMs:

Human F1: $87.9$; GPT-3.5-16K: $37.8$; best RAG: $41.4$ (Maharana et al., 2024)
MIRIX achieves $85.38\%$ judge accuracy, +7.95% over strongest baseline, and matches full-context ( $87.52\%$ ) with only retrieved, structured memory (Wang et al., 10 Jul 2025)
ENGRAM-R achieves $\sim89\%$ input token and $\sim72\%$ reasoning token reduction, with minimal accuracy drop (Patel et al., 17 Nov 2025)
MemLoRA adapters yield up to $47.2\%$ judge accuracy for 2B SLM, +90% over base small models; MemLoRA-V boosts VQA accuracy from $22\%$ to $81.3\%$ (Bini et al., 4 Dec 2025)
Privacy-hybrid Socratic CoT improves F1 by $7.1$ pp over GPT-4o baseline (Bae et al., 19 Jun 2025)
TReMu enables explicit neuro-symbolic temporal reasoning, raising accuracy from $29.83\%$ (SP prompting) to $77.67\%$ (Ge et al., 3 Feb 2025)

System	Overall QA Accuracy (%)	Temporal (%)	VQA (%)	Input Token Reduction	Key Feature
GPT-3.5-16K	37.8	<25	23.7–22.0	--	Full context baseline
MIRIX	85.38	88.39	--	99.9%	Multi-agent memory
ENGRAM-R	75.6	69.2	--	88.4%	Fact card orchestration
MemLoRA-2B	47.2	--	--	70–90%	Adapter distillation
MemLoRA-V-2B	40.3	--	81.3	70–90%	Multimodal adapter

These metrics derive directly from system tables in source papers.

6. Extensions: Temporal, Multimodal, and Privacy-Sensitive Benchmarks

LOCOMO is highly extensible for demanding memory scenarios:

Temporal Reasoning (TReMu augmentation): LoCoMo-based multi-choice QA on temporal anchoring, precedence, interval, and unanswerable categories. Timeline summarization and neuro-symbolic code execution improve event ordering and interval calculation (Ge et al., 3 Feb 2025).
Multimodal LoCoMo (LOCOMO-V): Visual QA modules integrated with conversational history, adapter-based SVLMs for efficient deployment on small hardware (Bini et al., 4 Dec 2025).
Privacy-Preserving LoCoMo: Hybrid encrypted retrieval combined with remote CoT decomposition for secure long-horizon QA tasks (Bae et al., 19 Jun 2025).

7. Controversies, Limitations, and Directions

Synthetic Data: LOCOMO dialogues are machine-generated and crowd-edited; transfer to fully real-world user logs remains an open question (Bini et al., 4 Dec 2025, Maharana et al., 2024).
Memory Model Limits: Even large ( $\geq$ 27B–120B) LLMs rarely achieve more than half human performance on long-horizon memory tests; careful structuring and adapterization substantially close the gap.
Adversarial/Unanswerable Exclusion: Many evaluations exclude adversarial items for comparability, but fail to probe robust failure modes.
Evaluation Bias: Heavy reliance on LLM-as-a-Judge metrics may overlook nuanced errors; sample human evaluation is recommended (Patel et al., 17 Nov 2025).
Visual Reasoning: Caption-based approaches poorly model fine-grained details; native SVLM modules are necessary for realistic VQA.

A plausible implication is that advances in memory orchestration, adapter specialization, and retrieval precision are essential for practical long-term conversational agents, as demonstrated by system-level gains on the LOCOMO benchmark suite.