Nemori: Adaptive Memory for LLM Agents
- Nemori is a self-organizing agent memory architecture designed to overcome the 'agent amnesia' of LLM-based systems by integrating episodic segmentation and proactive semantic memory distillation.
- It employs a Two-Step Alignment Principle and a Predict-Calibrate Principle to structure and refine both episodic and semantic memories in real-time conversational environments.
- Nemori demonstrates superior long-range contextual understanding and token efficiency on benchmarks like LoCoMo and LongMemEval, outperforming traditional static memory systems.
Nemori is a self-organizing agent memory architecture for LLM-based agents, designed to address the persistent memory limitations inherent in current systems. Drawing inspiration from cognitive science, Nemori integrates episodic segmentation and proactive knowledge distillation through two foundational principles: the Two-Step Alignment Principle, informed by Event Segmentation Theory, and the Predict-Calibrate Principle, grounded in the Free-Energy Principle. Nemori equips autonomous agents with both persistent and adaptively evolving memory, suitable for long-term, streaming conversational settings, and demonstrates state-of-the-art empirical performance on benchmarks demanding long-range contextual understanding (Nan et al., 5 Aug 2025).
1. Motivation and Problem Setting
LLM agents, despite strong performance in isolated sessions, exhibit "agent amnesia": all conversational history is lost between sessions. This limitation is rooted in two technical factors—quadratic attention scaling, restricting feasible context length, and the "Lost in the Middle" phenomenon, which degrades retrieval and reasoning over lengthy contexts. Retrieval-Augmented Generation (RAG) and early Memory-Augmented Generation (MAG) systems have made partial progress, but are fundamentally limited:
- They operate on static, pre-indexed knowledge rather than true online conversational streams.
- They segment data using heuristics or fixed-size chunks, not semantically coherent units.
- Their fact extraction is passive and rule-based, precluding any adaptation or evolution of agent knowledge.
For truly autonomous, long-term interactive agents, memory requires both cross-session persistence and the ability to evolve through interaction, paralleling human episodic and semantic memory formation processes. Nemori is designed to realize these dual cognitive processes in LLM-based agents (Nan et al., 5 Aug 2025).
2. Core Architectural Modules
Nemori consists of three interacting modules, maintained by a unified vector-based retrieval engine:
- Topic Segmentation (Boundary Alignment): Buffers incoming messages, applying an LLM-based boundary detector to determine episodic boundaries.
- Episodic Memory Generation (Representation Alignment): Upon boundary detection, converts a segment into a structured episodic memory—comprising a concise title and a third-person narrative—supporting event chunking.
- Semantic Memory Generation (Predict-Calibrate Cycle): Employs the episode title to retrieve related semantic memories, predict episode content, identify prediction gaps, and distill novel knowledge, fostering a continually adapting semantic memory base.
Both episodic and semantic stores are indexed for high-efficiency vector retrieval, facilitating rapid, contextually relevant access during subsequent reasoning tasks (Nan et al., 5 Aug 2025).
3. Two-Step Alignment Principle
The Two-Step Alignment Principle operationalizes cognitive Event Segmentation Theory for agent memory systems:
- Boundary Alignment: Formally, for message buffer , a boundary detector computes a boolean indicator and confidence . Segmentation is triggered if a high-confidence event boundary is detected or the buffer reaches a predefined maximum length:
Upon , is emitted as an episode and the buffer is reset.
- Representation Alignment: The raw segment is passed to an episode generator to produce where 0 is a title summarizing the segment and 1 a third-person narrative, both inserted into the Episodic Memory Database. The title 2 serves as a key for triggering semantic memory learning (Nan et al., 5 Aug 2025).
4. Predict-Calibrate Principle
Inspired by the Free-Energy Principle, Nemori's Predict-Calibrate Principle comprises a multistage process:
- Prediction Stage: Given a new episode 3, a dense vector search retrieves relevant semantic memories 4. An LLM-based predictor 5 forecasts the episode content 6 using 7 and 8.
9
0
- Calibration Stage: The predicted episode 1 is compared not to the narrative 2, but to the original message buffer 3. A semantic knowledge distiller 4 extracts novel facts representing the prediction gap:
5
- Integration Stage: Newly distilled facts 6 are merged into the Semantic Memory Database 7. While Nemori does not directly use gradient descent to update LLM parameters, this external knowledge refinement mirrors variational free-energy minimization strategies in cognitive science (Nan et al., 5 Aug 2025).
5. Empirical Evaluation
Nemori's efficacy is evaluated on two primary benchmarks:
- LoCoMo: Consists of 10 dialogues averaging 24K tokens with 1,540 questions over four reasoning types.
- LongMemEval8: Consists of 500 conversations averaging 105K tokens, focused on scalability testing.
Methods compared include FullContext (upper-bound LLM context), RAG-4096 (static chunking), and MAG systems such as LangMem, Zep, and Mem0. Key metrics are LLM-Score (GPT-4judges), F1, and BLEU-1. The backbone LLMs used are GPT-4o-mini and GPT-4.1-mini, with retrieval of top-9 episodic and top-0 semantic memories (default 1).
Nemori achieves:
- On LoCoMo (GPT-4o-mini), Nemori scores 0.744 LLM-Score, surpassing FullContext (0.723), with only 12% of the token usage.
- On LongMemEval2, average accuracy is 64.2% vs. 55.0% (FullContext), requiring only 3.7–4.8K tokens—a 95% reduction (Nan et al., 5 Aug 2025).
| Method | LLM-Score | F1 | BLEU-1 |
|---|---|---|---|
| FullContext | 0.723 | 0.462 | 0.378 |
| LangMem | 0.513 | 0.358 | 0.294 |
| Mem0 | 0.613 | 0.415 | 0.342 |
| RAG-4096 | 0.302 | 0.208 | 0.164 |
| Zep | 0.585 | 0.375 | 0.309 |
| Nemori | 0.744 | 0.495 | 0.385 |
Nemori's advantage is especially pronounced in temporal reasoning, where accurate episode-level organization and semantic memory retrieval are essential.
6. Ablation, Analysis, and Implications
Ablation studies on LoCoMo demonstrate:
- Removing both episodic and semantic memory ("w/o Nemori") leads to near-zero performance.
- Direct semantic extraction (Nemori-s) is inferior (0.518 LLM-Score) to the full Predict-Calibrate approach (0.615 for "w/o e"), confirming that proactive distillation outperforms passive methods.
- Dropping episodic memory ("w/o e") degrades performance (to 0.615) more than removing semantic memory ("w/o s": 0.705), indicating their complementarity.
Performance plateaus beyond top-3 episodic retrieval, suggesting diminishing returns for larger retrieval sets. Qualitative examples show Nemori's episodic segmentation successfully groups semantically related conversational turns that baseline chunking methods fragment, yielding better factual coherence on retrieval and reasoning tasks. On the more challenging LongMemEval4, Nemori maintains a lead over FullContext, particularly for user-preference questions. Slight accuracy declines on assistant-focused queries suggest challenges in preserving highly granular information over extremely long spans—a candidate for future investigation (Nan et al., 5 Aug 2025).
7. Future Directions
Planned developments for Nemori include end-to-end fine-tuning of Predict-Calibrate components and further integration with memory-efficient LLM architectures. Extension to multimodal episodic memory—incorporating sensory or visual data—remains an area for future research. The architectural framework established by Nemori lays groundwork for truly autonomous, self-evolving agents with human-like, long-term memory organization and adaptive knowledge evolution (Nan et al., 5 Aug 2025).