EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

Published 23 Apr 2026 in cs.CL and cs.AI | (2604.21229v1)

Abstract: LLM assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

Abstract PDF Upgrade to Chat

Authors (1)

Julian Acuna

Summary

The paper introduces EngramaBench, a benchmark that rigorously evaluates long-term conversational memory using structured graph retrieval to test factual recall, temporal reasoning, and synthesis.
It employs a controlled evaluation of three memory architectures—GPT-4o full-context, Mem0, and Engrama—with composite metrics that assess capabilities across single-space and cross-space queries.
Results reveal that although full-context prompting leads in overall performance, Engrama's graph-based memory outperforms on cross-space integration while reducing query-time costs.

Authoritative Summary of "EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval" (2604.21229)

Benchmark Overview and Problem Formulation

The paper introduces EngramaBench, a specialized benchmark for rigorous evaluation of long-term conversational memory systems. EngramaBench is designed to stress-test a range of memory behaviors—factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis—across five canonical personas and 100 multi-session conversations, yielding 150 queries. Each query is posed post hoc, challenging systems with delayed recall rather than next-turn prediction, thereby simulating realistic AI assistant deployment.

The benchmark's structure partitions each persona’s life into five recurring semantic spaces, explicitly supporting queries that require combining information across these domains. Distinct task families—single-space recall, cross-space integration, temporal reasoning, adversarial abstention (unsupported queries), and synthesis—are defined to disentangle memory abilities often conflated in prior benchmarks.

Scoring is organized to balance factual recall, adversarial abstention, and synthesis, with composite metrics encapsulating aggregate performance. Evidence annotations at the conversation and message level provide deterministic auditability.

Memory System Architectures and Controlled Evaluation

Three memory architectures are evaluated:

GPT-4o full-context prompting: All historical conversations are concatenated and injected directly into the prompt, leveraging the model’s extended context window but maintaining unstructured transcript inclusion.
Mem0(Chhikara et al., 28 Apr 2025): An open-source vector retrieval system seeded with canonical histories, utilizing text-embedding retrieval via ChromaDB and matching its off-the-shelf configuration as a flat memory baseline.
Engrama: A proprietary graph-structured memory system, organizing conversational traces into a dynamic graph mapping entities, semantic spaces, temporal events, and cross-domain associations. Retrieval is query-conditioned, with entity-first activation and structured composition of relevant memory regions for answering.

All architectures interface with the same answering model (GPT-4o), isolating the impact of memory mechanisms from purely modeling effects.

Key Results and Comparative Analysis

Quantitative results demonstrate that GPT-4o full-context achieves the highest composite score (0.6186), with Engrama at 0.5367 and Mem0 at 0.4809. However, Engrama is the only system to outperform full-context prompting on cross-space integration (0.6532 vs. 0.6291 for GPT-4o), indicating a measurable advantage for graph-structured memory in compositional reasoning tasks that require multi-domain evidence aggregation.

Mem0, while cost-minimal ($0.36 per 150 queries), is substantially weaker in factual recall and temporal reasoning, implying that vector-based retrieval alone is insufficient. Engrama achieves competitive performance with reduced query-time cost ($0.67 for 150 queries), reaching 86.8% of GPT-4o’s composite at ~20% of serving cost and exhibiting cross-space reasoning superiority.

Ablation studies show that components driving Engrama’s cross-space strength (entity-first activation, typed answer layers) trade off against global composite score, exposing a systems-level tension between specialized memory behaviors and aggregate task optimization. Structured retrieval mechanisms offer targeted benefits but do not uniformly translate to overall score improvements.

Theoretical Implications and Architectural Insights

EngramaBench modularizes and explicates conversational memory abilities, advancing beyond prior work (e.g., LongMemEval (Wu et al., 2024), LoCoMo (Maharana et al., 2024), MemGPT (Packer et al., 2023), GraphRAG (Edge et al., 2024), HippoRAG (Gutiérrez et al., 2024)) by positioning cross-space reasoning as a benchmark slice and highlighting compositional retrieval. The results articulate that flat vector retrieval is strictly suboptimal for both factual and compositional queries, and transcript inclusion suffices only for aggregate performance but not for highly structured reasoning.

Graph-based memory surfaces enable activation over relational neighborhoods, supporting answer synthesis from distributed evidence across sessions and domains. Ablations confirm that entity-centric retrieval and typed-answer composition are not universal quality multipliers; their utility manifests specifically in cross-space query contexts.

Empirically, temporal reasoning remains an unsolved challenge, with all systems—including full-context prompting—underperforming (maximum 0.3902), signaling unaddressed architectural limitations for long-horizon temporal dynamics.

Practical Impacts and Future Directions

The practical implications are twofold: (1) graph-structured memory architectures have distinct advantages for compositional cross-domain conversational queries; (2) memory system design must reconcile specialized behaviors with global task optimization, as naive addition of structural retrieval levers incurs tradeoffs.

A salient cost–quality regime is delineated: Engrama’s graph memory provides superior compositional recall at reduced serving expense, but full-context prompting remains dominant on simple factual recall and adversarial abstention.

Future research directions entail:

Benchmark extension with finer evidence provenance and robust synthesis evaluation.
Architecture refinement to ensure that structural strengths consistently produce global gains across task families.
Introduction of more naturalistic conversational baselines to chart the boundary of brute-force inclusion versus structurally effective long-term memory.

Limitations

EngramaBench utilizes synthetic, blueprint-constrained conversations for benchmarking, ensuring interpretability but limiting ecological validity relative to production settings. Evidence recovery is annotated rather than audited at chunk-level granularity for all systems. The emergent synthesis scorer is token-level rather than judgment-based, demanding caution in interpreting results. Sample sizes per family ( $n=30$ ) restrict statistical power for per-metric pairwise comparisons; directional trends should be confirmed with larger datasets.

Technical details of Engrama’s proprietary architecture are disclosed only to the extent needed for reproducibility and comparative analysis.

Conclusion

The paper delivers a nuanced empirical assessment of memory architectures in long-term conversational assistant contexts. Structured graph memory yields distinct compositional benefits on cross-space queries, while full-context prompting leads in composite metrics and simple recall. The architectural tension between specialized structural memory and aggregate performance is elucidated with careful ablation. EngramaBench establishes a reproducible regime for evaluating these contrasts and opens avenues for memory architecture optimization and broader benchmarking.

Markdown Report Issue