Memori: Scalable Memory for LLM Agents

Updated 2 July 2026

Memori is a structured memory framework for LLM agents that converts unstructured interactions into semantic triples and summaries for precise context retrieval.
It employs a hybrid retrieval method combining vector search and BM25 keyword matching to drastically compress context windows and reduce token usage.
Memori demonstrates significant efficiency improvements, achieving state-of-the-art accuracy and multi-turn temporal reasoning with less than 5% of the full context token footprint.

Memori denotes a class of memory-structuring and retrieval architectures for LLM agents, emphasizing persistent, context-aware, and cost-efficient recall across extended interactions. In both academic and applied settings, Memori systems supersede naïve transcript concatenation by adopting explicit data structuring—primarily semantic triples and conversation summaries—thus allowing highly scalable, precise, and LLM-agnostic context retrieval during inference. Such architectures are increasingly favored for advanced conversational AI, personal agent design, and memory evaluation research.

1. Definition, Motivation, and Theoretical Principles

Memori addresses the intrinsic limitations of prior memory approaches for LLM agents, where memory management was treated as linear transcript growth or proprietary storage integration. Traditional methods result in "token explosion" (prompt context ballooning up to 26,000 tokens per turn), "context rot" (loss of salient facts as windows increase), and "vendor lock-in" (tightly coupled memory–LLM solutions). Memori reframes agent memory as a data-structuring problem: noisy, unstructured interactions are transformed into compact, high-signal artifacts—semantic triples and summaries—facilitating efficient retrieval and coherent reasoning even under strict token and cost constraints (Borro et al., 20 Mar 2026).

Key theoretical underpinnings include:

LLM-Agnostic Memory Layer: Decoupling memory management from any particular model or API. Memory is maintained outside the LLM, ensuring modularity and portability.
Structured Representations: Use of atomic, semantically rich triples (subject, predicate, object) and concise conversation summaries.
Persistent, Time-stamped Storage: All memory units are indexed with timestamps, session identifiers, and linkage pointers, supporting robust temporal and multi-turn reasoning.
Context Compression: Substantially reducing the token footprint required for effective retrieval—compressing to ≈5% of the raw context window without significant loss in accuracy.

2. System Architecture and Data Pipeline

Memori is architected as an API-layer SDK operable between application logic and the underlying LLM. All prompts traverse Memori, enabling systematic ingestion, structuring, and selective context injection. The primary pipeline consists of:

Advanced Augmentation
- Semantic Triple Extraction: Each utterance is parsed to yield triples (t = (subject, predicate, object)), e.g., (“User”, “prefers”, “vegan dinner”).
- Conversation Summarization: Sessions are periodically condensed to summaries (s_S), tracking intent evolution and decision points; each triple carries a pointer to the associated summary.
Memory Storage and Indexing
- Vector Index: Semantic triples {e_t} are embedded (e.g., all-MiniLM, text-embedding-ada-002) and indexed via FAISS or equivalent for efficient approximate nearest-neighbor retrieval.
- Text Index: Keyword-driven summary retrieval is enabled via BM25 or similar, over the set of session summaries {s_S}.
Retrieval and Fusion
- For a given query q:
  1. Compute e_q = E(q) (embedding).
  2. Retrieve top-k triples by cosine similarity.
  3. Retrieve top-m summaries using keyword search.
  4. Merge and score candidates, optionally reranking with hybrid (cosine + BM25) score.
  5. Inject the final selection into the LLM prompt (Borro et al., 20 Mar 2026).

Internal Data Representation

Component	Example Entry	Purpose
Triple	(“User”, “visited”, “Paris”), timestamp, session ID, summary	Factual, atomic knowledge unit
Summary	“User planned a trip to Europe ...”, timestamp range	Narrative, temporal context

3. Retrieval Algorithms and Token Efficiency

The Memori retrieval module ensures that only the most salient information is presented to the LLM at inferential time, resulting in drastic compression of prompt size. The strategy employs parallel retrieval via:

Semantic Querying: Cosine similarity in embedding space between the query and stored triples, returning the k-nearest triples.
Keyword-based Retrieval: BM25 (or equivalent) keyword search over summaries, efficiently surfacing relevant narrative context.
Hybrid Scoring and Merging: Final selection leverages a tunable hybrid score, e.g.:

$\mathrm{score}_\text{hybrid}(t) = \lambda \cdot \mathrm{sim}_\text{cos}(e_q, e_t) + (1-\lambda) \cdot BM25(q, link_t)$

This dual mechanism ensures both factual precision and context continuity without monolithic history scans. On the LoCoMo benchmark, Memori compresses context to an average of 1,294 tokens per query (~5% of the 26,031-token full history), yielding token and cost reductions of approximately 20× versus full-context methods (Borro et al., 20 Mar 2026).

4. Empirical Performance and Comparative Evaluation

Memori demonstrates state-of-the-art accuracy and efficiency on rigorous long-term memory benchmarks:

Method	Overall Accuracy (%)	Tokens/query	Footprint (%) (*)
Memori	81.95	1,294	4.97
Zep	79.09	3,911	15.02
LangMem	78.05	10,863	41.73
Full-Context	87.52	26,031	100.00

*Footprint: proportion of the full-history context window.

Memori outperforms other retrieval-based competitors on overall accuracy (by 2.86% absolute vs Zep), and approaches the full-context oracle while using <5% of the context budget (Borro et al., 20 Mar 2026). Cost reductions are proportional, with Memori incurring only $0.001 per query (at$0.8 per 1M tokens), compared to $0.021 for full-context.

5. Multi-Session, Multi-Turn, and Temporal Reasoning

By associating every triple with timestamps and summary-level back-pointers, Memori supports:

Longitudinal Reasoning: Recall of facts/events spanning days, weeks, or indefinite agent deployments.
Dynamic User Profiles: Automatic user model updates across multiple sessions, supporting personalization and evolving preferences.
Temporal Resolution: Proven support for temporal and multi-hop queries via summary linkage and explicit timestamp sorting.

Example: - Session 1: “I love hiking.” → store (“User”, “loves”, “hiking”), summary_1. - Session 2: “Do you remember what I love?” → retrieval recomposes the triple and summary_1 → LLM reply: “You love hiking.”

6. Implementation Best Practices and Architectural Insights

Implementation best practices highlighted by Memori include:

Use of a decoupled SDK to separate memory management from LLM backbones.
Priority for structured, atomic facts (triples) over raw text chunks to reduce noise and enhance retrieval accuracy.
Batch compaction of older memories via summarization to prevent index bloat as history grows.
Hybrid retrieval (cosine + BM25) for balancing semantic and keyword matches.
Persistent linkage between triples and conversation summaries ensures grounding in narrative context, pivotal for both factual and multi-turn reasoning.

7. Broader Impact and Future Trajectories

Memori’s paradigm articulates the broader trend of treating memory as a critical, structured asset for LLM-backed agents, with practical and theoretical implications:

Enables scalable, cost-efficient deployment of personalized, multi-session AI agents across industry applications.
Suggests a modular blueprint for interoperable and LLM-agnostic memory design, aligning with both privacy-preserving requirements and industry standards.
Positions memory structuring and information retrieval—not model scale or transcript length—as the bottleneck and key determinant for agentic coherence and utility (Borro et al., 20 Mar 2026).

Empirically, Memori bridges the performance gap between simple retrieval (RAG) and full-context methods, indicating diminishing returns for raw context window expansion compared to principled memory structuring.

References

"Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents" (Borro et al., 20 Mar 2026)

Markdown Report Issue Upgrade to Chat

References (1)

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memori.