Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Published 20 Mar 2026 in cs.LG | (2603.19935v1)

Abstract: As LLMs evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.

Abstract PDF Upgrade to Chat

Summary

The paper presents a decoupled, LLM-agnostic persistent memory layer that achieves 81.95% accuracy with significantly reduced token usage compared to full-context methods.
The paper details an Advanced Augmentation pipeline that transforms noisy dialogues into structured semantic triples and summaries, enhancing precise information retrieval.
The paper demonstrates substantial cost efficiency, reducing token consumption by up to 95% and enabling scalable, enterprise-level deployment of LLM agents.

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Motivation and Problem Statement

LLM-based agents increasingly require persistent memory to enable context-aware, cross-session behavior, especially as deployment scenarios demand continuity and adaptation. Memory in this paradigm is not merely a task of storing historical data, but a challenge of efficiently structuring and retrieving salient information to maximize reasoning capacity while minimizing token and operational costs. Current solutions suffer from excessive token consumption due to raw context injection, leading to vendor lock-in and performance degradation predominantly arising from context rot and instability.

Architectural Overview

Memori proposes a decoupled, LLM-agnostic persistent memory layer integrated at the API level. The core innovation is the Advanced Augmentation pipeline, which transforms noisy, unstructured dialogue into structured semantic triples (subject–predicate–object) and succinct conversation summaries. The memory assets are indexed and managed using a hybrid retrieval system: embeddings (Gemma-300) facilitate semantic similarity search, while BM25 aids in keyword-based retrieval. The integration is seamless, as the Memori SDK wraps LLM clients, intercepts conversational exchanges, and manages memory updates and retrieval natively.

Semantic Triple Extraction and Summarization

Semantic Triple Generation: Dialogues are parsed and distilled into atomic facts, user preference evolutions, and constraints, yielding low-noise, high-signal representations for efficient retrieval.
Conversation Summarization: Summaries preserve the narrative and temporal progression, contextualizing the triples and enabling temporal reasoning. Each triple is directly linked to its originating summary, creating an interconnected memory substrate.

This dual memory structure is central: triples optimize direct retrieval, while summaries reconstruct temporal and contextual dependencies needed for advanced reasoning tasks.

Experimental Evaluation

Benchmarking Methodology

Experiments utilize the LoCoMo benchmark, which evaluates an agent’s ability to track, recall, and synthesize information across extended, noisy chat histories with complex state tracking and temporal reasoning requirements. Memori’s pipeline is empirically validated against established memory systems: Zep, LangMem, Mem0, and a Full-Context upper bound. The evaluation employs LLM-as-a-Judge protocol using GPT-4.1-mini, ensuring consistency across categorical reasoning tasks.

Numerical Results

Memori achieves 81.95% overall accuracy, outperforming Zep (79.09%), LangMem (78.05%), and Mem0 (62.47%). The system utilizes only 1,294 tokens per query—just 5% of the full context—yielding approximately 67% fewer tokens than Zep and over 20× savings compared to full-context methods.

Reasoning Categories

Single-Hop (87.87%): Superior performance driven by structured, noise-free context, facilitating precise recall.
Temporal (80.37%): Strong results, with room for optimization; summaries partially mitigate the inherent limitations of triples in temporal narrative reconstruction.
Multi-Hop (72.70%): Robust evidence chaining enabled by linked triples and summaries.
Open-Domain (63.54%): Remains challenging due to lack of explicit retrieval anchors, indicating limitations of atomic triple compression for broad synthesis requirements.

Cost Efficiency

Operational efficiency is rigorously measured. Memori’s token footprint is an order of magnitude smaller than full-context deployments. API cost per query is reduced directly proportional to token savings, with practical implications for enterprise-scale agents. The structured memory design eliminates context expansion, minimizes hallucination risk, and improves inference stability.

Implications and Future Directions

Memori establishes that structured memory architectures, rather than enlarged context windows, are key to high-fidelity, scalable LLM agents. The system’s LLM-agnostic integration and efficient memory representation support practical, multi-session deployment. The compact, high-quality context provided by semantic triples and summaries implies a new standard for persistent agent memory, paving the way for:

Optimized memory extraction pipelines for improved temporal reasoning and open-domain synthesis.
Research on automated methods for dynamically refining memory structure and retrieval strategies.
Enhanced LLM-agent interoperability independent of model vendors, fueling persistent, adaptive agent platforms.

The results challenge assumptions in RAG and traditional memory injection, emphasizing structured retrieval as both a theoretical and practical advance in conversational agent design.

Conclusion

Memori delivers a persistent memory layer that achieves state-of-the-art accuracy among retrieval-based systems with minimal token footprint, decoupling performance from context size. The Advanced Augmentation pipeline’s structured memory assets support exact recall and coherent reasoning, enabling cost-effective, scalable LLM agent deployment. This architectural approach redefines persistent memory as a structuring, not storage, problem—eliminating trade-offs between reasoning quality and cost. Future work should address further memory structuring improvements, promoting robust, context-aware agents for complex, multi-session environments.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in plain words)

This paper introduces Memori, a “memory layer” for AI chatbots and assistants. Instead of stuffing the whole chat history into an AI’s prompt every time (which is slow, expensive, and confusing), Memori turns past conversations into neat, compact notes the AI can quickly look up. The goal: help AIs remember important things across many chats, stay accurate, and keep costs low.

The main questions the authors asked

Can we store conversation history in a smarter, more structured way so the AI remembers what matters without rereading everything?
Will this “structured memory” work with any LLM (not just one brand)?
Can we keep answers accurate while using far fewer “tokens” (the AI’s reading budget that costs money)?
How does this approach compare to other memory systems on a tough, long-conversation test?

How Memori works (with simple analogies)

Think of an AI assistant as a student. If you give the student the whole textbook (every past message) before every question, they’ll waste time and might miss the key points. Memori helps by making:

Tiny fact cards
Short chapter summaries

Here’s the pipeline in everyday terms:

Semantic triples = fact cards
- “Ava — prefers — vegetarian pizza”
- “Project — deadline — June 15”
- Each card links back to the original chat where it came from. That way, the AI can find the source if needed.
Conversation summaries = chapter summaries These give a quick story of what happened in a session: the user’s goal, how it changed, and why decisions were made.

By linking fact cards to chapter summaries, the AI gets both precise details and the bigger story—without wading through pages of chat logs.

How the system searches and answers:

It stores the cards and summaries in a fast index (like a super-organized filing cabinet).
When a question comes in, Memori searches by meaning and by keywords to fetch the most relevant cards and summaries.
It then gives only those to the AI model to answer the question.
Another AI model checks if the answer is correct (this is called “LLM-as-a-judge,” which is like a second grader reviewing the first).

What are “tokens”? Tokens are small chunks of text (like word pieces). More tokens = more cost and more chance the AI gets overwhelmed. Memori’s goal is to use far fewer tokens by sending only the most important info.

What they tested

They used a benchmark called LoCoMo (Long Conversation Memory), which is designed to test if an AI can remember details across long, messy, multi-session chats. It includes:

Simple fact questions (Single-Hop)
Questions needing multiple pieces of info (Multi-Hop)
Time-based questions (Temporal)
Open-ended questions (Open-Domain)

They ran Memori on this benchmark and compared it to other memory systems. They also measured how many tokens each system adds to the prompt and how much that costs.

Main results (and why they matter)

Overall accuracy: 81.95% Memori beat other memory systems like Zep (79.09%), LangMem (78.05%), and Mem0 (62.47%). The only thing higher was giving the AI the entire conversation history (“Full-Context”), which scored 87.52% but is very expensive and impractical.
Strong at direct facts and multi-step questions
- Single-Hop: 87.87% (great at pulling exact facts)
- Multi-Hop: 72.70% (good at connecting pieces)
- Temporal: 80.37% (solid, but there’s room to improve tracking changes over time)
- Open-Domain: 63.54% (hard for everyone because it’s broad and less tied to specific facts)
Huge token and cost savings
- Memori used about 1,294 tokens per query—only about 5% of the “Full-Context” method.
- That’s more than 20× cheaper than passing the entire chat history.
- Compared to another memory system (Zep), Memori used about 67% fewer tokens while being more accurate.

Why it matters: Using fewer tokens lowers costs and reduces confusion. By sending only relevant facts and summaries, Memori helps the AI stay focused and consistent, even across long-term, multi-session use.

What this means going forward

Smarter structure beats bigger context windows The paper shows that organizing memory into small fact cards plus short summaries can deliver high accuracy without stuffing prompts full of text. That makes AI assistants faster, cheaper, and more reliable over time.
Works with many AI models Memori is “LLM-agnostic,” meaning it’s built to work with different AI providers. That avoids vendor lock-in and makes it easier to deploy in real apps.
Practical for real-world assistants For personal assistants, customer support, or research helpers, Memori can remember user preferences and past decisions across many sessions—without breaking the bank.
Future improvements The team notes that time-based reasoning (tracking changes over time) can improve. Better modeling of timelines could push performance even higher.

In short: Memori shows that giving AIs a clean, well-organized notebook—rather than the whole messy diary—helps them remember better, answer accurately, and keep costs low.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following issues unresolved or insufficiently explored:

External validity is limited to a single benchmark (LoCoMo), with the adversarial category excluded; generalization to real-world, long-horizon, multi-session deployments remains untested.
The “LLM-agnostic” claim is not empirically supported: evaluation uses a single generator (GPT‑4.1‑mini) and a single embedding model (Gemma‑300); cross‑LLM, cross‑vendor, and cross‑embedding robustness are unknown.
Using GPT‑4.1‑mini as both the answer generator and the LLM judge risks bias and overestimation; no human evaluation or cross‑model judging is reported.
Baseline comparability is weakened by importing results from other papers; controlled, head‑to‑head re‑runs with shared prompts, models, and infrastructure are missing.
No ablation studies quantify the contribution of each component (semantic triples vs summaries vs hybrid retrieval vs BM25 only vs embeddings only) to accuracy and token savings.
Temporal reasoning underperforms: there is no concrete mechanism for time‑aware indexing (e.g., event graphs, temporal predicates, decay functions) or experiments that isolate where temporal failures occur.
Open‑domain questions remain a weakness; criteria and policies for selectively retrieving longer raw text when semantic compression harms performance are unspecified.
The triple extraction process lacks error analysis: sensitivity to negation, uncertainty, hypotheticals, sarcasm, and speaker attribution (who said what) is unquantified.
The schema/ontology for triples (predicate set, normalization, disambiguation rules) is unspecified; how the schema evolves across domains and prevents predicate explosion is unclear.
Conflict resolution is ad hoc (e.g., “prefer most recent” in prompts); formal policies for contradiction detection, reconciliation, and canonicalization across sessions are missing.
Memory lifecycle management is undefined: strategies for forgetting, deduplication, pruning, and long‑term drift prevention are not described or evaluated.
Scalability at production scale is untested: retrieval latency, throughput, and index size growth across millions of triples and multi‑tenant settings are not measured.
Cost accounting excludes the ongoing compute cost of Advanced Augmentation (LLM calls for triple/summarization), storage costs, and indexing overhead; total cost of ownership is unknown.
Token efficiency is measured only for context tokens; output tokens, augmentation tokens, and vendor tokenization differences are not included, potentially skewing cost claims.
Security considerations are absent: risks of prompt‑injection/memory‑poisoning via stored memories and mitigation (sanitization, trust scores, provenance checks) are not addressed.
Privacy and compliance gaps: handling of PII in extracted triples/summaries, consent, data minimization, encryption, retention/deletion (GDPR/CCPA), and auditability are unspecified.
Identity and coreference resolution across sessions/users (e.g., multiple people, shared devices) is not discussed; mechanisms to prevent cross‑user memory leakage are unclear.
Multilingual robustness is not evaluated; it is unknown whether triple extraction, summarization, and retrieval maintain quality across languages and code‑switching.
Multimodal inputs (images, audio, tool outputs) are not handled; how to represent and retrieve non‑textual memory with triples/summaries remains open.
Integration with external knowledge (RAG over documents) is not explored; policies for reconciling conversation memory with external KB facts are missing.
Retrieval stack design space is underexplored: no tests with cross‑encoder re‑ranking, hybrid graph+vector retrieval, or query‑dependent top‑k/threshold tuning.
Knowledge graph operations over triples (e.g., path queries, reasoning, constraints) are not leveraged; storing triples only in a vector index may limit reasoning fidelity.
Robustness to noisy or adversarial inputs (e.g., contradictory statements, user trolling) and long‑horizon “context rot” is not stress‑tested beyond LoCoMo.
Statistical rigor is limited: aside from n=3 variance for Memori, there are no significance tests, confidence intervals, or per‑category error breakdowns for failure analysis.
Reproducibility details (random seeds, exact hyperparameters, retrieval k/thresholds, index settings, and full prompts) are partially provided but not sufficient to ensure faithful replication.
Human-in-the-loop mechanisms are absent: APIs for inspecting, editing, or deleting specific memories and for developer/operator overrides are not described.
Cold‑start behavior and learning curves (performance vs. amount of interaction history) are not measured; how quickly useful memory accumulates is unknown.
Deployment considerations (latency budgets, concurrency, SDK overhead, fault tolerance) are not quantified, leaving operational feasibility uncertain.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are practical use cases that can be deployed now using the paper’s LLM‑agnostic persistent memory layer and Advanced Augmentation (semantic triples + conversation summaries). Each item notes likely sectors, potential tools/products/workflows, and feasibility assumptions/dependencies.

Cost-optimized customer support and CRM copilots — sectors: software, retail/e-commerce, telecom
- What: Multi-session support agents that remember past issues, device models, preferences, and resolutions without re-injecting full histories.
- Tools/workflows: Memori SDK wrapping existing LLM clients; hybrid retrieval (cosine + BM25); FAISS or managed vector DB; memory audit dashboard; ticket linking by timestamps.
- Why now: 67% fewer tokens than Zep and >20× vs. full-context while retaining strong accuracy (overall 81.95%).
- Assumptions/dependencies: PII redaction, consent flows, RBAC; memory poisoning defenses; latency budget for augmentation pipeline.
Account-based sales assistants — sectors: sales, marketing, B2B SaaS
- What: Copilots that recall account history, decision makers, objections, contract terms, and renewal risk across meetings and emails.
- Tools/workflows: Connectors to CRM (Salesforce, HubSpot), calendar, email; meeting-to-triple extraction; opportunity timeline summaries; cost dashboard for per‑account savings.
- Assumptions/dependencies: Accurate timestamp alignment; compliance with email/calendar data policies; user opt-in.
Meeting, inbox, and workspace memory for productivity — sectors: software/productivity, enterprise IT
- What: Assistants that summarize threads and persist action items, commitments, and follow-ups across Slack/Teams/Gmail/Docs.
- Tools/workflows: Channel-level conversation summaries; atomic triples for tasks/owners/dates; hybrid search for retrieval into replies; governance console.
- Assumptions/dependencies: API access scopes; organization data retention policies; per‑workspace rate limits and throughput for augmentation.
Developer copilots with project memory — sectors: software engineering, DevOps
- What: Coding assistants that remember design decisions, coding conventions, module responsibilities, and incident postmortems across sprints.
- Tools/workflows: Git/PR/issue tracker connectors; triple extraction for APIs/owners/deprecations; build an indexed “project memory” store; CI bots retrieving precise context.
- Assumptions/dependencies: Repository privacy and on-prem deployment options; domain adaptation to code entities; guardrails to avoid leaking secrets.
Healthcare intake and care-coordination chat — sectors: healthcare
- What: Agents that recall longitudinal patient preferences, allergies, care plans, and social determinants across encounters.
- Tools/workflows: EHR and portal integration; memory provenance via timestamps; concise context injection to reduce cost for high-volume triage.
- Assumptions/dependencies: HIPAA/GDPR compliance, auditability, clinician-in-the-loop; safety validation; medical terminology adaptation.
Personalized education tutors with longitudinal memory — sectors: education/EdTech
- What: Tutors that track misconceptions, mastery milestones, learning goals, and motivation across semesters.
- Tools/workflows: LMS connectors; per-student triple store (concept → status → evidence); lesson plan summaries; targeted retrieval into explanations and quizzes.
- Assumptions/dependencies: Parental consent and FERPA compliance; bias monitoring; content moderation.
Financial advisor and support copilots — sectors: finance, fintech, insurance
- What: Agents that recall KYC facts, product holdings, risk tolerance, and recent service events to personalize guidance.
- Tools/workflows: CRM + policy engines; timestamped audit trails; structured retrieval to minimize context and cost in regulated settings.
- Assumptions/dependencies: Compliance review (e.g., FINRA/SEC), strict hallucination control, explainability of retrieved facts.
Legal case and contract memory — sectors: legal, compliance
- What: Assistants that persist matter timelines, clause preferences, and precedent references; enable quick retrieval with provenance.
- Tools/workflows: DMS connectors; triple extraction for parties/obligations/dates; matter-scoped indices; eDiscovery-ready audit logs.
- Assumptions/dependencies: Privilege and confidentiality; legal hold integration; clear deletion/retention policies.
Internal knowledge assistants with controlled context budgets — sectors: enterprise IT, shared services
- What: Replace full-history prompts with structured memories to keep inference costs predictable while improving recall quality.
- Tools/workflows: Memori as a drop-in memory layer for existing RAG assistants; cost/accuracy dashboards; memory quality evaluations (LLM-as-a-Judge packaged).
- Assumptions/dependencies: Throughput and scaling of augmentation jobs; heterogeneous data connectors; change management.
Government/civic digital service bots — sectors: public sector, policy
- What: Citizen-facing agents that persist case details (benefits, forms submitted, appointments) across sessions with audit trails.
- Tools/workflows: Right-to-be-forgotten APIs; FOIA-compliant logging with timestamps; cost-limited operations at scale via structured retrieval.
- Assumptions/dependencies: Strong privacy and consent, accessibility requirements, multilingual support.
Consumer “life OS” assistants — sectors: consumer apps, smart home
- What: Personal AI that remembers preferences, routines, shopping lists, travel constraints, and family schedules across apps.
- Tools/workflows: Cross-app connectors; “personal memory vault” using triples + summaries; export/import for provider portability.
- Assumptions/dependencies: On-device or encrypted storage options; granular consent; safety features for minors.

Long-Term Applications

The following applications are high-impact but require further research, scaling, or ecosystem development (e.g., temporal reasoning advances, multimodal extraction, standards, or regulation).

Standardized portable AI memory vaults — sectors: consumer, policy, cloud platforms
- What: Open schemas and APIs for exporting/importing personal memory (triples + summaries) across assistants to prevent vendor lock‑in.
- Tools/workflows: Memory portability protocols; consent and provenance layers; “forget/retain” policy engines.
- Dependencies: Industry standards, regulatory guidance on data portability and deletion.
Organization-wide “memory bus” and inter-agent knowledge graph — sectors: enterprise software, security
- What: A shared, permissioned memory graph powering many agents (support, HR, finance) with per‑resource access controls and provenance.
- Tools/workflows: Multi-tenant graph stores; zero-trust memory APIs; lineage tracking and anomaly detection.
- Dependencies: Fine-grained governance, performance isolation, defenses against memory poisoning.
Robust temporal reasoning modules — sectors: academia, software, legal, healthcare
- What: Event calculus/temporal KGs to model state changes, recency, and contradictions more precisely than summaries alone.
- Tools/workflows: Temporal constraints solvers; decay/update operators; time-aware retrieval scoring.
- Dependencies: Research on temporal inference benchmarks and methods; integration with existing triple stores.
Multimodal memory extraction (text + voice + images + video) — sectors: healthcare, retail, robotics, field service
- What: Convert multimodal logs (e.g., call recordings, inspection photos) into triples and timeline summaries.
- Tools/workflows: ASR, OCR, VLMs feeding the augmentation pipeline; multimodal embeddings; provenance across modalities.
- Dependencies: Model accuracy for domain jargon; compute and latency budgets; privacy for audio/video.
On-device/private memory for edge agents — sectors: mobile, IoT, robotics
- What: Local memory layer (quantized embeddings + FAISS) for low-latency, private assistants and home/industrial robots.
- Tools/workflows: Incremental sync to cloud; encrypted at-rest/on-device; energy-aware augmentation.
- Dependencies: Efficient on-device models, storage constraints, reliable sync/conflict resolution.
Secure memory sanitization and provenance — sectors: security, compliance
- What: Tooling to detect, quarantine, and roll back poisoned or adversarially inserted memories, with signed entries and WORM logs.
- Tools/workflows: Content anomaly detectors; cryptographic signing/provenance; memory diffing and rollback.
- Dependencies: Threat modeling, standards for signing/verifying memory artifacts.
Autonomous research assistants with evolving memory — sectors: pharma, materials, academia
- What: Agents that track hypotheses, methods, datasets, and results over months, enabling cumulative scientific workflows.
- Tools/workflows: Literature-to-triple pipelines; hypothesis and evidence graphs; automated lab notebook integration.
- Dependencies: High precision extraction; attribution/copyright; domain adaptation.
Healthcare longitudinal digital twins — sectors: healthcare/biomedical
- What: Patient/stateful digital twins driven by structured clinical memories for care-path optimization and personalized interventions.
- Tools/workflows: Interop with FHIR; clinician-facing provenance views; safety-graded retrieval.
- Dependencies: Clinical validation, regulatory approvals, bias/safety audits.
Mastery graphs and learning analytics at scale — sectors: education
- What: District-wide student knowledge graphs that guide adaptive curricula and targeted interventions across years.
- Tools/workflows: Cross-institution data sharing with consent; teacher dashboards; outcome-driven memory pruning.
- Dependencies: Longitudinal efficacy studies; privacy statutes; equitable access.
Compliance engines with policy-aware retention/forgetting — sectors: finance, government, legal
- What: Automated enforcement of retention schedules and “right to be forgotten” across shared memory stores, with audit-ready traces.
- Tools/workflows: Policy DSLs; automated erasure/reindex; regulator-facing evidence packages.
- Dependencies: Legal acceptance, verifiable deletion semantics, interop across vendors.
Enterprise knowledge graph auto-construction from unstructured comms — sectors: software/enterprise data
- What: Continuous ETL from chats/emails/tickets into a curated knowledge graph for analytics, search, and reasoning.
- Tools/workflows: Augmentation → KG pipelines (e.g., Neo4j); entity resolution; change-data capture.
- Dependencies: Data quality, privacy filtering, human-in-the-loop curation.
Memory asset marketplaces and federated querying — sectors: data platforms, marketplaces
- What: Privacy-preserving exchanges of domain memory (e.g., anonymized support patterns) enabling cross-org benchmarking.
- Tools/workflows: Federated queries; differential privacy; licensing and usage enforcement.
- Dependencies: Standards, incentives, privacy tech maturity.

Cross-cutting assumptions and dependencies

Extraction quality: Accurate semantic triple and summary generation underpins retrieval performance; domain adaptation may be needed.
Embeddings and retrieval: Choice and tuning of embedding models, hybrid search, and index scaling affect accuracy and latency.
Privacy/compliance: PII redaction, consent management, encryption, RBAC, and auditability are prerequisites in regulated sectors.
Safety and robustness: Guarding against hallucinations, prompt injection, and memory poisoning is essential for production.
Temporal reasoning: Current gaps (e.g., open-domain and complex temporal questions) may require specialized modules and schemas.
Operational economics: Augmentation compute costs must remain below the prompt-token savings; throughput/latency SLOs need monitoring.
Vendor neutrality: LLM-agnostic design enables portability but requires standardized memory schemas and migration tooling.

View Paper Prompt View All Prompts

Glossary

Adversarial category: An evaluation subset designed to challenge models with intentionally difficult or misleading cases. "we excluded the adversarial category from the evaluation"
Advanced Augmentation: A pipeline that distills raw dialogue into structured, retrievable memory assets. "Advanced Augmentation functions as an automated cognitive filter."
BM25 keyword matching: A classical information retrieval scoring function used to rank documents by keyword relevance. "combines cosine similarity over embeddings with BM25 keyword matching."
Cognitive filter: A mechanism that selectively extracts high-signal information from noisy input. "Advanced Augmentation functions as an automated cognitive filter."
Compression layer: A component that reduces verbose information into compact representations for efficient storage and retrieval. "Second, it functions as a compression layer."
Conversation Summaries: Concise overviews of conversational threads that capture intent, chronology, and context. "the pipeline simultaneously generates Conversation Summaries."
Context degradation: The phenomenon where large prompts reduce a model’s ability to use relevant information effectively over time. "context degradation over time."
Context footprint: The proportion of the full conversation added to the prompt during retrieval. "Context Footprint (\%)"
Context rot: A failure mode where relevant information exists in the prompt but is not effectively used. "suffer from what is commonly referred to as context rot, in which relevant information is present but not effectively used"
Context window: The maximum number of tokens an LLM can consider at once. "leads to rapidly growing context windows."
Cosine similarity: A metric that measures similarity between embedding vectors based on the angle between them. "combines cosine similarity over embeddings with BM25 keyword matching."
FAISS: A library for efficient similarity search and clustering of dense vectors. "All generated memories were indexed and stored locally using FAISS to support fast similarity search."
Foundation-model-powered systems: Applications built on large, general-purpose pretrained models. "These foundation-model-powered systems perform well in research, software engineering, and scientific discovery"
Full-Context (Ceiling): A benchmark condition where the entire conversation history is provided to the model. "Full-Context (Ceiling)"
Gemma-300 embedding model: A specific embedding model used to convert text into semantic vectors. "embedded using the Gemma-300 embedding model"
Hybrid search: A retrieval approach that combines multiple matching strategies (e.g., semantic and lexical). "Triples were retrieved using a hybrid search approach"
Knowledge base: An organized repository of structured information for retrieval and reasoning. "shifting the system's memory from mere text storage to an organized knowledge base."
LLM-as-a-Judge: An evaluation method where an LLM grades the correctness of another model’s outputs. "We employ an LLM-as-a-Judge methodology"
LoCoMo benchmark: A dataset designed to test long-term conversational memory and reasoning. "Evaluated on the LoCoMo benchmark, Memori achieves 81.95\% accuracy"
Lost in the middle: A retrieval failure where important information in the middle of long context is overlooked. "risk of \"lost in the middle\" hallucinations."
Memory creation pipeline: An automated sequence that extracts, compresses, and organizes conversation data into memories. "It is a background memory creation pipeline"
Multi-Hop Reasoning: Answering questions that require chaining multiple facts across different parts of memory. "Multi-Hop Reasoning (72.70\%): Memori performs strongly"
Open-Domain Reasoning: Handling broad, open-ended questions that require synthesis across diverse information. "Open-Domain Reasoning (63.54\%): This category remains challenging"
Persistent memory: Long-term storage enabling agents to retain information across sessions and models. "persistent memory at the API layer is essential"
Retrieval-Augmented Generation (RAG): Systems that enhance generation by fetching relevant context from external sources. "as is standard in traditional RAG architectures"
Semantic triples: Structured facts represented as subject–predicate–object tuples for precise retrieval. "structuring them into semantic triples (subjectâpredicateâobject)."
Similarity search: Finding vectors or items in a database that are most similar to a query vector. "to support fast similarity search."
Single-Hop Reasoning: Answering questions with a direct fact lookup without chaining multiple pieces of evidence. "Single-Hop Reasoning (87.87\%): Memori excels in direct fact retrieval"
Subject–predicate–object: The three-part structure of a semantic triple encoding factual relationships. "semantic triples (subjectâpredicateâobject)."
Temporal Reasoning: Understanding and reasoning about time-dependent information and changes over sessions. "Temporal Reasoning (80.37\%): Memori outperforms Mem0"
Token footprint: The number of tokens added to a prompt to ground a response. "This token footprint represents just 4.97\% of the full conversational context"
Vector search: Retrieving items based on the proximity of their embedding vectors. "improves vector search retrieval accuracy."
Vector space: The high-dimensional space in which embeddings represent text semantics. "the resulting vector space becomes heavily cluttered."

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Summary

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Motivation and Problem Statement

Architectural Overview

Semantic Triple Extraction and Summarization

Experimental Evaluation

Benchmarking Methodology

Numerical Results

Reasoning Categories

Cost Efficiency

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in plain words)

The main questions the authors asked

How Memori works (with simple analogies)

What they tested

Main results (and why they matter)

What this means going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Summary

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Motivation and Problem Statement

Architectural Overview

Semantic Triple Extraction and Summarization

Experimental Evaluation

Benchmarking Methodology

Numerical Results

Reasoning Categories

Cost Efficiency

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in plain words)

The main questions the authors asked

How Memori works (with simple analogies)

What they tested

Main results (and why they matter)

What this means going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets