Papers
Topics
Authors
Recent
Search
2000 character limit reached

LatentMem Framework

Updated 25 February 2026
  • LatentMem is a framework that condenses historical interactions into compact, continuous representations to efficiently inject memory into LLM-based systems.
  • FlashMem leverages a frozen backbone and cross-attention to extract latent memory, achieving up to 5× inference speedup while maintaining competitive accuracy.
  • In multi-agent settings, role-aware latent memory reduces token usage by 50% and boosts performance by up to 19.36 percentage points, enhancing coordinated reasoning.

LatentMem refers to a series of frameworks for learnable, token-efficient memory in LLM-based systems. It appears in two primary formulations: as a mechanism for intrinsic dynamic memory in single-agent LLM reasoning, and as a generalized, role-aware memory schema for multi-agent LLM systems. Both aim to avoid context window exhaustion, redundant computation, and loss of relevant historical information by condensing past interactions into continuous, fixed-length vectors ("latent memories") that are efficiently injected back into the model’s computation. FlashMem and multi-agent LatentMem frameworks exemplify these approaches and introduce key architectural and algorithmic innovations to maximize both efficiency and adaptability (Hou et al., 9 Jan 2026, Fu et al., 3 Feb 2026).

1. Motivations and Challenges of Latent Memory

LLMs, operating under a stateless paradigm—πθ(x1,,xt)\pi_\theta(x_1,\ldots,x_t) with static parameters θ\theta—recompute attention from scratch on each step. This design causes quadratic or worse growth in key-value (KV) cache size, repeated processing of identical contexts, and inevitable exhaustion of finite context windows when the full history must be replayed. In multi-agent settings, additional bottlenecks emerge: memory stores are often homogenized across agent roles (leading to correlated errors and poor specialization), and excessively fine-grained or textual memory quickly leads to information overload, obstructing critical context (Hou et al., 9 Jan 2026, Fu et al., 3 Feb 2026).

Latent memory frameworks mitigate these issues by condensing relevant past experiences into compact, continuous representations (typically matrix-valued latent tokens) for downstream reuse. Injecting these memories enables efficient recall of high-utility information without overwhelming context buffers or requiring parameter modifications.

2. Core Architectural Components

FlashMem (Single-Agent)

The FlashMem architecture extracts memory directly from the backbone LLM’s frozen KV cache. The backbone produces the last hidden state, ht=fθ(x1:t)h_t = f_\theta(x_{1:t}), treated as a sufficient statistic for all preceding history τ<t\tau_{<t}. Memory consolidation begins by projecting hth_t into an initial memory seed: m0=MLPproj(ht)m_0 = \text{MLP}_\mathrm{proj}(h_t) which is then refined through cross-attention against the backbone’s cached KRt×dK \in \mathbb{R}^{t \times d} and VRt×dV \in \mathbb{R}^{t \times d}: Attn(x,K,V)=softmax(xWQKd)V\text{Attn}(x, K, V) = \mathrm{softmax}\left(\frac{x W_Q K^\top}{\sqrt{d}}\right)V No new key or value projections are introduced; the consolidator reuses the backbone’s live cache (“Shared-KV”). A small set (KtK \ll t) of latent memory vectors M={m1,,mK}M = \{m_1,\ldots,m_K\} is autoregressively decoded from this cache-enabled consolidator.

LatentMem (Multi-Agent)

LatentMem for MAS consists of:

  • Experience Bank B\mathcal{B}: Stores raw sequences of agent names, prompts, and outputs—no summaries or engineered features (Fu et al., 3 Feb 2026).
  • Memory Composer Cϕ\mathcal{C}_\phi: A trainable Transformer which, conditioned on each agent’s role embedding γ\gamma and retrieved trajectories Tq\mathcal{T}_q, synthesizes a compact, agent-specific latent memory mj=σϕ(γk,Tq)RL×Dm_j = \sigma_\phi(\gamma_k, \mathcal{T}_q) \in \mathbb{R}^{L' \times D}.
  • Memory Injection: For each agent, the latent memory mjm_j is concatenated to its input token embeddings, yielding an augmented hidden state passed to the frozen policy.

Memory Retrieval and Update

To generate context-relevant memories, queries and historical trajectories are embedded and their cosine similarity is computed for retrieval: Tq=top-KτiB{cos(v(q),v(τi))}\mathcal{T}_q = \operatorname{top\text{-}K}_{\tau_i \in \mathcal{B}} \left\{ \cos\left( \mathbf{v}(q), \mathbf{v}(\tau_i) \right) \right\} Following each episode, new trajectories are appended to B\mathcal{B}, supporting continual online adaptation.

3. Formalization and Optimization

Sufficient-Statistic Principle

Both FlashMem and multi-agent LatentMem formalize the final hidden state or the composed memory as a sufficient statistic: P(atτ<t,ot)P(atht)P(a_t \mid \tau_{<t}, o_t) \approx P(a_t \mid h_t) where hth_t encapsulates all predictive information for future actions. In FlashMem, the following information-theoretic constraint is enforced: KL[P(atτ<t)P(atht)]0\mathrm{KL}\left[ P(a_t \mid \tau_{<t}) || P(a_t \mid h_t) \right] \rightarrow 0 and, equivalently, I(ht;at)I(τ<t;at)I(h_t; a_t) \approx I(\tau_{<t}; a_t) (Hou et al., 9 Jan 2026).

Latent Memory Policy Optimization (LMPO)

For MAS, memory representations are optimized end-to-end. Given simulated agent rollouts {τ^i}\{\hat{\tau}_i\} with rewards R(τ^i)R(\hat{\tau}_i), agent-specific latent memories mjm_j are differentiably injected, allowing gradients to flow from downstream objectives through the composer. The LMPO objective mirrors PPO with group-based advantages: JLMPO(ϕ)=Eq,Tq[1i,jTi,ji=1Gj=1Ht=1Ti,jclipSur(ϕ;i,j,t)]\mathcal{J}_{\rm LMPO}(\phi) = \mathbb{E}_{q, \mathcal{T}_q} \left[ \tfrac{1}{\sum_{i,j}T_{i,j}} \sum_{i=1}^G\sum_{j=1}^H\sum_{t=1}^{T_{i,j}} \operatorname{clipSur}(\phi; i,j,t) \right] where

clipSur=min(ri,j,tA^i,clip(ri,j,t,1ε,1+ε)A^i)\operatorname{clipSur} = \min \left( r_{i,j,t} \hat{A}_i,\, \mathrm{clip}(r_{i,j,t}, 1-\varepsilon, 1+\varepsilon) \hat{A}_i \right)

and ri,j,tr_{i,j,t} is a token-level likelihood ratio (Fu et al., 3 Feb 2026).

4. Control and Adaptation Mechanisms

Cognitive Monitoring in FlashMem

A parameter-free cognitive monitor assesses the predictiveness of the current context using attention entropy as a proxy for epistemic uncertainty. For each attention head, Shannon entropy is computed after masking out “sink” tokens. Aggregated entropy HtH_t triggers memory consolidation only if it exceeds a threshold τ\tau, typically set to the 85th85^\text{th} percentile of held-out entropy values. This ensures consolidation occurs only during high-uncertainty phases, reducing unnecessary computation.

Role Conditioning in Multi-Agent Memory

LatentMem explicitly incorporates an agent’s role embedding γ\gamma into memory generation, ensuring that distilled vectors are discriminative and agent-specific. Ablation studies confirm that omitting role conditioning leads to substantial accuracy degradation (e.g., 6.45-6.45 percentage points on MacNet), indicating its necessity for coordination and specialization (Fu et al., 3 Feb 2026).

5. Integration and Inference Workflow

LatentMem memory is injected into the LLM policy or agent as a continuous vector, not via context concatenation or parameter modification. In FlashMem, ‘soft injection’ occurs by running the backbone LLM on the KK latent vectors to obtain their KV pairs, appending these to the live cache, and continuing generation without re-encoding prior tokens.

In MAS, each agent’s mjm_j is concatenated to its token embeddings: h~j=concat(hj,mj)\tilde{h}_j = \mathrm{concat}(h_j, m_j) resulting in:

  • No modification to policy parameters,
  • No loss of differentiability for downstream optimization,
  • Composable, plug-and-play augmentation adaptable to any framework or agent backbone.

6. Empirical Results and Ablation Analyses

FlashMem Results

On benchmarks such as GSM8K, MATH, GPQA, KodCode, BookSum, and GovReport, FlashMem achieves task accuracy at parity or slightly above strong latent memory baselines (e.g., MemGen: 70.54% vs 70.09% on GSM8K; 46.55% vs 50.16% on MATH for Qwen 2.5 1.5B), while reducing end-to-end inference latency by approximately 5×5\times, consuming \sim31.4 GB peak VRAM in 64k-token contexts and delivering \sim20.9 tok/s throughput (Hou et al., 9 Jan 2026).

LatentMem in MAS

Across knowledge QA, code, reasoning, and planning tasks, LatentMem provides up to +19.36+19.36 percentage points accuracy on out-of-domain PopQA and consistent mean gains over vanilla single- and multi-agent memory schemas. LMPO-trained memory yields 50%50\% fewer tokens and $2/3$ inference time relative to textual baselines, and outperforms multi-agent fine-tuning approaches (MARTI) by up to +11.73+11.73 points on TriviaQA and +2.60+2.60 on KodCode under matched compute (Fu et al., 3 Feb 2026).

Table 1: LatentMem: Key Empirical Performance Highlights

Scenario FlashMem Speedup LatentMem MAS Gain
Reasoning Quality (vs. MemGen) %%%%37KRt×dK \in \mathbb{R}^{t \times d}38%%%% latency +19.36 pp (PopQA/DyLAN)
Context Compression O(1) injection 50% token reduction
Role-Dependency Ablation Not applicable 6.45-6.45 pp (MacNet, no-role)

Ablation studies indicate performance plateaus for latent memory length at L8L' \approx 8, and LatentMem remains robust to larger KK for trajectory retrieval, unlike text-based methods that collapse beyond K>3K > 3 (Fu et al., 3 Feb 2026).

7. Positioning within the Memory-Augmented LLM Landscape

LatentMem frameworks—through frozen backbone reuse, entropy-based gating, role-conditioned memory composition, and policy-driven optimization—stand in contrast to architectures reliant on auxiliary encoders, text replay, or parameter retuning. Their design provides efficient, dynamically customized memory for both single-agent cognitive longevity and multi-agent continual adaptation. This positions LatentMem as a foundational methodology for scalable, memory-augmented LLM reasoning and coordination without architectural modification or memory-induced context collapse (Hou et al., 9 Jan 2026, Fu et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LatentMem Framework.