StructMem: Structured Memory for Long-Horizon Behavior in LLMs

Published 23 Apr 2026 in cs.CL, cs.AI, cs.IR, cs.LG, and cs.MA | (2604.21748v1)

Abstract: Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a novel hierarchical memory using event-level binding and cross-event consolidation to improve temporal and multi-hop reasoning in LLMs.
It achieves superior performance on long-horizon queries with efficiency gains demonstrated by reduced token usage and API calls compared to baseline methods.
The approach synthesizes relational abstractions effectively, minimizing hallucination and error propagation while enabling robust, temporally-grounded responses.

Structured Memory for Long-Horizon Behavior in LLMs: An Expert Technical Analysis

Motivation and Positioning: The Challenge of Long-Term Reasoning

Long-horizon conversational agents require memory architectures that support not only factual recall but also the reasoning over temporally extended, causally entangled episodes. Prior systems bifurcate primarily into two classes: flat memory approaches, which enable scalable storage and retrieval but operate as unstructured repositories; and graph-based approaches, which impose relational structure but encounter high construction costs, cascaded inference latencies, and susceptibility to error propagation. This fundamental efficiency–structure trade-off has constrained agentic LLMs’ capacity for robust, long-range, and multi-hop temporal reasoning.

Figure 1: Three paradigms of Memory systems.

The StructMem Architecture: Hierarchical, Temporally-Grounded Memory

StructMem introduces a hierarchical memory framework centered on event-based units, targeting the preservation of both rich context and cross-temporal relationships without explicit symbolic graph schemas or heavy-weight entity disambiguation.

Figure 2: StructMem's hierarchical memory organization. Event-Level Binding constructs event-level structure by extracting dual perspectives and anchoring them temporally. Cross-Event Consolidation constructs cross-event structure through semantic retrieval, event reconstruction, and consolidation synthesis.

Event-Level Binding leverages a dual-perspective extraction protocol, employing targeted prompts to parse each utterance into:

Factual entries: atomic event content descriptors.
Relational entries: interpersonal dynamics, causality, and dependencies.

All extracted entries are temporally anchored, producing fine-grained, timestamped event units. This design maintains explicit linkages between content and context at memory formation time.

Cross-Event Consolidation implements a periodic synthesis stage: after accumulating events over a temporal buffer, semantically similar historical entries are retrieved using embedding similarity, forming event clusters for consolidation. Rather than lossy text summarization or rigid graph merging, consolidation synthesizes new relational hypotheses—structuring episodic clusters and generating abstractions critical for multi-hop and temporal reasoning.

Experimental Methodology and Comparative Evaluation

The evaluation targets the LoCoMo benchmark, designed for very long-term, multi-session dialogues (average: 588 turns, 16.6K tokens). StructMem is compared with RAG-based retrieval, flat memory, and multiple structural/graph memory paradigms using controlled backbone (gpt-4o-mini, text-embedding-3-small) and LLM-as-a-judge automatic evaluation.

Metrics:

Effectiveness: Overall, Multi-hop, Open-domain QA, Single-hop, and Temporal reasoning accuracy.
Efficiency: Token usage, API calls, runtime during memory construction.

Empirical Results: Performance, Efficiency, and Fidelity

StructMem achieves the highest overall performance (76.82% on LoCoMo, gpt-4o-mini judge) compared to all baseline categories; critically, it outperforms both flat and relational graph memories on demanding temporal and multi-hop tasks. For temporal reasoning, StructMem attains an 81.62% score—substantially higher than flat memory or graph memory paradigms.

Moreover, StructMem delivers this effectiveness with exceptional resource efficiency: token consumption and API calls for memory construction are minimized (1.937M tokens, 1056 calls), dramatically lower than graph-based approaches (e.g., Mem0 $^\text{g}$ : 35.8M tokens, 53K calls). The key technical advantage is the amortization of consolidation cost via event buffering and batch processing, rather than per-event cascading operations.

Figure 3: Analysis of efficiency across memory paradigms and internal mechanisms of StructMem.

Ablation and internal mechanism analysis demonstrate that cross-event consolidation yields genuine multi-hop inference benefits, synthesizing relational abstractions inaccessible by scaling up flat retrieval alone. Flat systems plateau as retrieval increases, while StructMem’s hierarchical synthesis reconstructs implicit, distributed dependencies.

Fidelity analysis shows a hallucination rate for extracted entries of only 2.36% and <1–4% for cross-event consolidations under constrained synthesis. Disabling temporal/concrete grounding constraints raises spurious association rates to >7–20%, empirically confirming StructMem's advantage in grounded abstraction.

Case Example: Joint Participation Temporal Reasoning

The superiority of StructMem’s approach is evident in compositional reasoning tasks. In a temporal co-participation query (“When did Caroline and Melanie go to a pride festival together?”), both flat and graph memory fail due to lack of event alignment or explicit co-reference. StructMem’s dual-perspective extraction and synthesis, however, reconstruct implicit relational context and produces the correct, temporally-grounded answer—a capability unattainable for non-hierarchical memories.

Methodological Considerations, Limitations, and Future Prospects

StructMem exemplifies a salient shift: the core memory unit is shifted from isolated facts or entity-relation triples to temporally-indexed, relationally-rich events. The system thereby sidesteps the rigidity and error accumulation endemic to graph-based memory while avoiding the contextual dilution that afflicts flat retrieval.

Nonetheless, prompt sensitivity in dual-perspective extraction and lack of conflict resolution mechanisms currently limit robustness in dynamic, evolving dialogue scenarios. Extensions in automated prompt optimization, memory updating, and decay strategies represent immediate research trajectories.

The paradigm also scales favorably within agentic settings (multi-agent, multi-session, long-horizon), given its efficiency and resilience to knowledge drift/hallucination—a prerequisite as conversational LLM agents move toward open-ended, autonomous operation.

Conclusion

StructMem sets a new benchmark for memory architectures supporting long-horizon agent reasoning. By unifying event-centric representation, dual-perspective contextualization, and batch cross-event consolidation, it delivers both superior effectiveness on complex, temporally-structured queries and highly efficient resource utilization. These advances underline a theoretical and practical reorientation toward episodic, hierarchical organization in agentic LLMs, promising more tractable and semantically robust agent memory for open-ended, real-world deployment (2604.21748).

Markdown Report Issue