LOCOMO: Long-Context Memory Benchmark

Updated 3 July 2026

LOCOMO benchmark is a synthetic evaluation suite that measures LLM long-term conversational memory through multi-session dialogues and multi-modal content.
It uses structured event graphs and human-in-the-loop verification to assess temporal reasoning, event summarization, and retrieval performance.
Quantitative analyses reveal cost-performance tradeoffs and scalability variations across memory architectures, guiding future advances in persistent conversational AI.

The LOCOMO (Long-Context Memory) benchmark is a synthetic evaluation suite and dataset designed to assess the very long-term conversational memory capabilities of LLMs and memory-augmented agents. LOCOMO is characterized by multi-session, long-horizon dialogues with complex event structures and multi-modal content, enabling rigorous evaluation of memory retention, temporal reasoning, and scalability of contemporary LLM architectures and memory systems. The benchmark plays a central role in both academic research and industrial deployment analysis, serving as a stress test for retrieval, memory composition, and cost-performance tradeoffs in persistent conversational AI.

1. Benchmark Design and Dialogue Generation

LOCOMO was introduced to fill a methodological gap in the rigorous assessment of long-term memory in dialogue agents (Maharana et al., 2024). The core of the benchmark is a set of machine-human curated, synthetic conversations:

Agent Architecture: Each dialogue features two agent instances (ℒ₁, ℒ₂) built around LLMs (e.g., gpt-3.5-turbo), each initialized with expanded persona profiles and a temporal event graph (G) outlining up to 25 life events across 6–12 months.
Reflect-and-Respond Memory System: During each session, agents utilize short-term (session summaries), long-term (database of observed facts), and persona/event graph memory for contextually grounded generation.
Multi-Modal Integration: Agents can generate, share, caption (via BLIP-2), and react to images, further grounding dialogue in both text and visual channels.
Human-in-the-Loop Verification: Annotators enforce long-range persona and event coherence, edit for event-graph alignment, and remove irrelevant or inconsistent content. Around 15% of dialogue turns and 19% of images are modified or pruned for integrity.

The resulting dataset comprises 50 dialogues, each with an average of 19.3 chat sessions, 304.9 turns (≈15.8/session), and 9,209 tokens per conversation, with ≈32 images per conversation for multi-modal task support.

2. Task Taxonomy and Evaluation Metrics

LOCOMO is structured around three primary evaluation tasks, each reflecting a different facet of conversational memory:

2.1 Question Answering (QA) over Long Dialogues

Task: Answer factual questions based on complete multi-session dialogue history (≈300 turns, ≈9K tokens).
Categories:
- Single-hop
- Multi-hop
- Temporal
- Open-domain (including world knowledge/persona context)
- Adversarial (unanswerable, requiring safe refusal)
Metric: F1-score at the token level:

$\mathrm{F1} = \frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

with Precision and Recall computed on answer tokens relative to ground-truth.

2.2 Event Summarization

Task: Summarize and order all major events as per the ground-truth event graph G.
Metrics:
- ROUGE-1, ROUGE-2, ROUGE-L (with β-weighted formula for longest common subsequence).
- FactScore: Fact extraction and F1 matching on atomic event units, prioritizing factual and causal alignment over n-gram overlap.

Task: Conditioned on prior context (text and images), generate the next reply with optional image grounding.
Metrics: BLEU-1/2, ROUGE-L, MM-Relevance (semantic alignment on persona, event, and visual content), BERTScore.

These metrics interrogate memory retention (QA F1), temporal/causal consistency (FactScore, ROUGE-L), grounding/persona fidelity (MM-Relevance, BLEU/ROUGE), and adversarial robustness.

3. System Baselines and Quantitative Results

Performance on LOCOMO has been established through a spectrum of system types (Maharana et al., 2024, Pollertlam et al., 5 Mar 2026, Tiwari et al., 31 Mar 2026, Gadzhiev et al., 13 Apr 2026, Terranova et al., 27 Oct 2025). Key results include:

System	QA F1 / Accuracy (%)	Event Summarization (FactScore F1)	MM-Dialogue (BLEU-1/2)	Adversarial Robustness
Human	87.9	N/A	N/A	89.4 F1
Mistral-7B	13.9	N/A	N/A	~2
Llama-2-Chat-70B	17.9	N/A	N/A	N/A
GPT-3.5-turbo (4K)	22.4	45.9	N/A	~2
GPT-4-turbo (4K)	32.1	45.1	N/A	N/A
GPT-3.5-turbo-16K	37.8	39.9	N/A	~2 (adversarial)
RAG (GPT-3.5-16K + obs)	41.4	N/A	N/A	N/A
Mem0 Memory System	57.68	N/A	N/A	Not reported
GPT-5-mini (long context)	92.85	N/A	N/A	Not reported
Synthius-Mem	94.37	N/A	N/A	99.55
MiniGPT-5 (+obs, BLEU-1/2)	N/A	N/A	59.7 / 35.1	N/A

Synthius-Mem establishes state-of-the-art accuracy (94.37%) and nearly perfect hallucination resistance (99.55%), surpassing both human reference F1 and all prior systems (Gadzhiev et al., 13 Apr 2026). Retrieval-based and structured-agentic memory systems outperform generic RAG and full-context approaches in cost and accuracy under increasing context length and query volume (Terranova et al., 27 Oct 2025, Pollertlam et al., 5 Mar 2026).

4. Memory Architectures and Cost-Performance Tradeoffs

LOCOMO is a testbed for a wide range of memory system designs. The principal approaches are:

Full-context LLMs: Directly ingest the entire dialogue. Achieve high recall when context fits in the LLM window, but scale linearly in compute/token cost with context length ( $O(L)$ per query, even under caching). GPT-5-mini at 16–20K tokens approaches 93% accuracy; with scaling, accuracy plateaus but costs rise (Pollertlam et al., 5 Mar 2026).
Fact-Based Memory (e.g., Mem0): Extract and index atomic facts via embedding or flat-typed parsing (compression ratio ≈35:1). Retain fixed per-query cost, but lose multi-hop and temporal cues, yielding ≈58% accuracy (Pollertlam et al., 5 Mar 2026).
Structured Agentic Systems (e.g., Synthius-Mem): Extract, consolidate, and index persona/conversation facts by semantic domain (biography, experiences, social, etc.), retrieve by CategoryRAG or similar routers. Achieve >94% accuracy and superior adversarial robustness (99.55%), with ~5x lower token cost than replay (Gadzhiev et al., 13 Apr 2026).
Multi-Layered Memory: Hierarchical frameworks (working, episodic, semantic) with adaptive retrieval and regularization (Tiwari et al., 31 Mar 2026). These address cross-session semantic drift and context growth by attention-based gating and weighted consolidation. On LOCOMO, MLMF yields 0.618 F1, improves multi-hop F1 and reduces false memory rates under constrained context budgets.

Critical cost models show that for context lengths L≈17K tokens, memory-based approaches break even at N≈15 queries and scale favorably as L increases (Pollertlam et al., 5 Mar 2026), while the highest-performing semantic/agentic systems further improve efficiency with minimal loss in recall.

5. Analysis of Capabilities and Limitations

LOCOMO reveals sharp limitations of current LLM-based and RAG memory agents (Maharana et al., 2024, Terranova et al., 27 Oct 2025, Li et al., 11 Feb 2026):

Long-term Factual and Temporal Memory: All tested models exhibit steep performance decay on facts/events located deep in history (esp. multi-hop, temporal), with temporal reasoning remaining the most error-prone category.
Adversarial and Hallucination Resistance: Only explicitly structured knowledge extraction (Synthius-Mem) reliably refuses to answer on unsupported premises and maintains near-zero hallucination rates (Gadzhiev et al., 13 Apr 2026).
Summarization and Event Graph Alignment: Even leading LLMs underperform on recall of atomic events and causal graph structure, with larger window size alone often yielding diminished focus or increased hallucination.
Multi-Modality: Textualized image captions are effective for text-only variants, but robust visual grounding across multi-session timelines remains unsolved (Maharana et al., 2024).
Efficiency and Cost: Memory-augmented approaches reduce average per-query token cost by >90% relative to brute-force context replay, though increased system complexity must be managed with respect to agentic control, index maintenance, and update semantics (Terranova et al., 27 Oct 2025).

6. Extensions and Critiques: Cognitive Memory, Evaluation Scope, and Benchmark Limitations

LOCOMO-Plus (Li et al., 11 Feb 2026) extends LOCOMO to "Level-2 Cognitive Memory," measuring constraint-consistent response under cue–trigger semantic disconnect, where a model must recall and obey implicit conversational constraints. All current methods—including RAG, A-Mem, and large LLMs—see a 30–45 point gap when shifting from factual to cognitive (constraint-consistency) scoring, indicating a fundamental failure to retain and apply latent or non-verbalizable user state.

ATANT v1.1 (Tanguturi, 13 Apr 2026) demonstrates by structural analysis that LOCOMO covers at most 2 of 7 properties required for genuine conversational continuity. Key defects include an "empty-gold" scoring bug (penalizing correct refusals as incorrect in adversarial cases) and string-matching bias against paraphrastic answers. Consequently, LOCOMO is a high-value stress test for long-context retrieval but does not suffice as a continuity or update-awareness benchmark.

7. Significance and Future Directions

LOCOMO and its derivatives have catalyzed a generation of research on memory scaling for persistent LLM agents—enabling the development and comparative evaluation of architectural strategies ranging from RAG and compression to multi-layered abstraction and neuroscience-inspired structured memory (Maharana et al., 2024, Gadzhiev et al., 13 Apr 2026, Tiwari et al., 31 Mar 2026). However, unresolved challenges remain, including:

Robustness to evolving user profiles and fact updating
Holistic persona and event graph completion
True cognitive/constraint memory evaluation
Integration of multi-modal (visual, event, psychometric) memory with operational agentic reasoning

Methodologically, future benchmarks will require extended coverage of latent state, update supersession, model-independence, and compositional, cross-domain grounding. The field is moving towards LLM-as-judge protocols, adversarial reliability measurement, and standardized multi-domain testbeds as exemplified by ATANT and LoCoMo-Plus (Tanguturi, 13 Apr 2026, Li et al., 11 Feb 2026). LOCOMO stands as a foundational reference point for both stress testing and guiding the evolution of long-context memory research in LLM-based conversational agents.