Papers
Topics
Authors
Recent
2000 character limit reached

SimpleMem: Efficient Lifelong Memory System

Updated 10 January 2026
  • SimpleMem is an efficient lifelong memory architecture that minimizes token bloat and redundancy for LLM agents in long-horizon tasks.
  • It features a three-stage pipeline—semantic compression, multi-view indexing, and recursive consolidation—to streamline context processing.
  • Adaptive, query-aware retrieval dynamically adjusts context assembly, achieving significant token compression and enhanced retrieval precision.

SimpleMem is an efficient lifelong memory architecture for LLM agents operating in complex, long-horizon environments. It is designed to maximize information density and minimize redundant token usage in memory storage and retrieval, directly addressing inefficiencies inherent in both passive long-context extension and costly active iterative reasoning. SimpleMem uses a structured, three-stage pipeline involving semantic lossless compression, multi-view atomic indexing, recursive consolidation into higher-level abstractions, and adaptive, query-aware retrieval, attaining a favorable balance of retrieval precision, compression, and scaling (Liu et al., 5 Jan 2026).

1. Architectural Overview and Motivation

SimpleMem targets the problem that LLM agents incur substantial inefficiency and redundancy when tasked with persistent memory: storing entire interaction histories leads to significant token bloat, while aggressive iterative reasoning to cull irrelevant context increases compute cost. The architecture operates as a loop with three principal stages—semantic structured compression, recursive memory consolidation, and adaptive query-aware retrieval—in which raw multi-turn dialogues are transformed into compact, indexed memory units, abstracted into higher-level representations, and then dynamically retrieved in response to task queries. Each cycle closes with new experience ingestion, completing the loop: ingestion → indexing → consolidation → retrieval → ingestion (Liu et al., 5 Jan 2026).

2. Semantic Structured Compression

The first stage reduces raw interaction transcripts to minimal, self-contained "memory units" via entropy-aware filtering, coreference/temporal normalization, and atomistic segmentation.

Entropy-Aware Filtering

Incoming dialogue is divided into overlapping windows WtW_t (default: W=10W=10 turns, stride=5). For each window, an entropy-based gate quantifies the utility of the content: H(Wt)=αEnewWt+(1α)[1cos(E(Wt),E(Hprev))],H(W_t) = \alpha\,\frac{|\mathcal E_{\mathrm{new}}|}{|W_t|} + (1-\alpha)\left[1 - \cos(E(W_t), E(H_{\mathrm{prev}}))\right], where Enew\mathcal E_{\mathrm{new}} are new entities, E()E(\cdot) denotes dense embeddings, and α\alpha tunes the balance between entity-level novelty and semantic divergence. Windows with H(Wt)<τredundantH(W_t) < \tau_{\mathrm{redundant}} (default τ=0.35\tau=0.35) are dropped, minimizing the accumulation of redundant or low-salience content.

Memory Unit Segmentation and Normalization

The retained windows are processed by a neural prompt/model function Fθ\mathcal F_\theta: mk=Fθ(Wt)=Φtime(Φcoref(Φextract(Wt))),m_k = \mathcal F_\theta(W_t) = \Phi_{\mathrm{time}} \left( \Phi_{\mathrm{coref}}(\Phi_{\mathrm{extract}}(W_t)) \right), where candidate facts are extracted, coreferences resolved, and temporal expressions normalized. The resulting mkm_k are context-independent, minimal units annotated with explicit timestamps, entity lists, and optional salience/topic tags.

Multi-View Indexing

Each mkm_k is indexed in a tri-view structure comprising:

  • Semantic Layer: vk=Edense(mk)\mathbf v_k = E_{\mathrm{dense}}(m_k)
  • Lexical Layer: hk=BM25(mk)\mathbf h_k = \mathrm{BM25}(m_k)
  • Symbolic Layer: Rk={key:val}\mathcal R_k = \{\text{key}:\text{val}\} (metadata: timestamps, entities)

This enables hybrid retrieval using both fuzzy semantic similarity and exact symbolic or lexical filtering.

3. Recursive Memory Consolidation

To further compress and abstract the active memory store, SimpleMem periodically clusters and merges memory units using semantic and temporal affinity.

Affinity-Based Clustering

Pairwise affinity between units mim_i and mjm_j is computed as: ωij=βcos(vi,vj)+(1β)exp[λtitj],\omega_{ij} = \beta\,\cos(\mathbf v_i, \mathbf v_j) + (1-\beta)\,\exp\bigl[ -\lambda|t_i - t_j| \bigr], with β\beta (typical 0.7) controlling the weight of semantic vs. temporal proximity, and λ\lambda (e.g., 0.1) setting the temporal decay rate. Units exceeding the clustering threshold τcluster\tau_{\mathrm{cluster}} (default 0.85) form a cluster C\mathcal C.

Abstract Representation Synthesis

Each cluster C\mathcal C is synthesized (via Gsyn\mathcal G_{\mathrm{syn}}) into a single abstract memory MabsM_{\mathrm{abs}}: Mabs=Gsyn({mi:iC}),M_{\mathrm{abs}} = \mathcal G_{\mathrm{syn}}(\{m_i : i \in \mathcal C\}), with original mim_i archived into cold storage. The abstract replaces granular units in the active index, maintaining a compact memory footprint while preserving the ability to recover details as needed. This compression is linear in the number of abstractions and mitigates both redundancy and context inflation.

4. Adaptive Query-Aware Retrieval

At inference time, SimpleMem employs a hybrid, query-sensitive retrieval mechanism to construct the context for LLM prompting.

Hybrid Relevance Scoring

Given query qq, each candidate mkm_k is scored: S(q,mk)=λ1cos(eq,vk)+λ2BM25(qlex,mk)+γI(RkCmeta),\mathcal S(q, m_k) = \lambda_1 \cos(\mathbf e_q, \mathbf v_k) + \lambda_2\,\mathrm{BM25}(q_\mathrm{lex}, m_k) + \gamma\,\mathbb I(\mathcal R_k \models \mathcal C_\mathrm{meta}), where eq\mathbf e_q is the query embedding and I\mathbb I allows for strict symbolic filters (e.g., timestamps in range).

Query Complexity and Dynamic Retrieval Depth

A small classifier predicts query complexity Cq[0,1]C_q \in [0,1] from query features (length, syntactic complexity, abstraction). The number of candidates retrieved is dynamically adjusted: kdyn=kbase(1+δCq),k_{\mathrm{dyn}} = \left\lfloor k_{\mathrm{base}} (1 + \delta C_q) \right\rfloor, where kminkdynkmaxk_{\min} \leq k_{\mathrm{dyn}} \leq k_{\max} (typical values: kmin=3k_{\min}=3, kmax=20k_{\max}=20, kbase=10k_{\mathrm{base}}=10, δ=1\delta=1). Simple queries retrieve a minimal number of abstracted units; complex queries expand retrieval to a deeper context window.

Final Context Assembly

The top-kdynk_{\mathrm{dyn}} ranked units are assembled: Cfinal=mTopkdyn[tm:content(m)],\mathcal C_{\mathrm{final}} = \bigoplus_{m \in \mathrm{Top}-k_{\mathrm{dyn}}}[t_m: \text{content}(m)], and prepended to the generation prompt, optimizing context relevance and capacity usage.

5. High-Level Information Flow and System Diagram

The pipeline integrates three operational stages and background consolidation, as summarized in the following block diagram:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
┌────────────┐         ┌─────────────────────────┐
│ Raw Dialogue ├─╼Stage 1► Semantic Structured   │
│ Streams      │         │ Compression (Atomizer)  │
└────────────┘         └─────────────────────────┘
                                  │
                                  ▼
                   ┌────────────────────────────┐
                   │ Multi-View Index (Active)  │
                   │  • Dense Embeddings (v_k)  │
                   │  • Lexical Sparse Index    │
                   │  • Symbolic Metadata (R_k) │
                   └────────────────────────────┘
                                  │   (async bg)
                                  ▼
                   ┌────────────────────────────┐
                   │ Recursive Consolidation    │
                   │  • Compute ω_{ij} affin.   │
                   │  • Cluster & Abstract      │
                   └────────────────────────────┘
                                  │
                                  ▼
                  ┌─────────────────────────────┐
                  │ Active Memory Store (slim)  │
                  └─────────────────────────────┘
                                  │   ┌─Query──┐
                                  ▼   ▼        │
               ┌───────────────────┐ ┌───────────────┐
               │ Complexity Est.   │ │ Hybrid Scoring│
               └───────────────────┘ └───────────────┘
                  │        │         └─Compute k_dyn─┘
                                  ↓
                   ┌────────────────────────────┐
                   │ Adaptive Retrieval (Top-k) │
                   └────────────────────────────┘
                                  │
                                  ▼
                   ┌────────────────────────────┐
                   │ Answer Generator (LLM)     │
                   └────────────────────────────┘

6. Key Hyperparameters and Operational Trade-Offs

SimpleMem's effectiveness is governed by several critical hyperparameters:

Stage Parameter Typical Value / Function
1 Window size WW, stride W=10W=10, stride=5
1 Novelty balance α\alpha $0.5$
1 Redundancy drop τredundant\tau_{\mathrm{redundant}} $0.35$
2 Cluster threshold τcluster\tau_{\mathrm{cluster}} $0.85$
2 Semantic-vs-temporal β\beta $0.7$
2 Temporal decay λ\lambda $0.1$
3 Base retrieval kbasek_{\mathrm{base}} $10$
3 Retrieval bounds [kmin,kmax][k_{\min},k_{\max}] [3,20][3,20]
3 Retrieval scale δ\delta $1.0$

These parameters instantiate explicit trade-offs: For instance, aggressive filtering (low τredundant\tau_{\mathrm{redundant}}) may risk retaining noise, while a high threshold can omit marginally relevant content. Similarly, tuning τcluster\tau_{\mathrm{cluster}} too low induces over-broad abstraction, whereas excessive strictness fragments memory and loses generalization. Retrieval bounds influence token efficiency versus the risk of omitting critical details (Liu et al., 5 Jan 2026).

7. Performance Characteristics and Applications

Experimental benchmarks demonstrate that SimpleMem consistently achieves superior accuracy, retrieval efficiency, and reduced inference-time token usage relative to baselines, with an average F1 improvement of 26.4% and up to 30-fold inference token compression (Liu et al., 5 Jan 2026). By unifying high-density compression, multi-faceted retrieval, and dynamic context assembly, SimpleMem is suited for LLM agents requiring scalable, lifelong memory with minimal resource overhead, supporting advanced multi-turn reasoning, and complex environment interactions. The codebase is available at https://github.com/aiming-lab/SimpleMem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimpleMem Architecture.