Papers
Topics
Authors
Recent
Search
2000 character limit reached

MemoryLLM: Augmenting LLMs with Persistent Memory

Updated 28 January 2026
  • MemoryLLM is a framework that augments transformers with explicit, persistent memory pools to enable continual knowledge integration and extended context reasoning.
  • It employs a shift–append update mechanism with exponential forgetting, ensuring high retention of injected facts and maintaining operational stability.
  • Hybrid extensions like SuMem combine short-term latent memory with scalable external memory to achieve effective context lengths beyond 160k tokens.

MemoryLLM encompasses a distinct class of approaches and architectures that endow LLMs with explicit, persistent, and updatable memory beyond standard transformer parameters. These methods are engineered to support continual knowledge integration, long-context reasoning, and efficient adaptation to new information, while maintaining operational stability and scalability throughout extended deployment periods. The foundations and evolution of MemoryLLM are illustrated through advances in latent-space memory mechanisms, hybrid retrieval systems, and rigorous evaluation frameworks.

1. Architectural Principles of MemoryLLM

MemoryLLM introduces an explicit memory pool, θ, distributed within the latent space of a base transformer model φ (e.g., LLaMA-2-7B). For each transformer layer ℓ, memory is instantiated as a parameter matrix:

θRN×d\theta_\ell \in \mathbb{R}^{N \times d}

where NN is the number of memory tokens per layer and dd is the hidden dimension. In the canonical 7B parameter implementation, N=7,680N=7,680 across L=32L=32 layers, yielding a ∼10910^9 parameter memory superstructure. Each forward pass fuses input token representations hh_\ell with θ\theta_\ell via cross-attention, tightly coupling session context with persistent latent memory. The semantics and dynamics of this architecture are defined by explicit read (attention) and write (update) interfaces, both embedded within the transformer computation graph (Wang et al., 2024).

2. Self-Update Dynamics and Retention Mechanism

Knowledge injection into MemoryLLM is realized through a deterministic update operator U(θ,xc)U(\theta, x_c), where xcx_c is new text context. The update procedure unfolds as:

  • At each layer ℓ, extract the last KK memory tokens, eθθ[NK,...,N1]e_\theta^\ell \leftarrow \theta_\ell[N-K, ..., N-1].
  • Concatenate eθe_\theta^\ell with the new token embeddings h(xc)h_\ell(x_c).
  • Run the transformer sublayer to derive updated e^θ\hat{e}_\theta^\ell.
  • Update the memory by discarding the oldest KK tokens (“shift left”), appending e^θ\hat{e}_\theta^\ell, so:

θ=[θ[0...NK1];e^θ]RN×d\theta_\ell' = [\,\theta_\ell[0 ... N-K-1] ; \hat{e}_\theta^\ell\,] \in \mathbb{R}^{N \times d}

Chained updates across tt steps yield exponential retention: any fact injected tt updates ago decays with retention ratio (1K/N)t(1 - K/N)^t. This enforces a smooth exponential forgetting curve, with global capacity governed by (N,K)(N, K). For standard hyperparameters (N=7680N=7680, K=256K=256), approximately e1e^{-1} of any injected information persists after 30 updates (Wang et al., 2024).

3. Long-Term Memory Extension and Retrieval (M+ / SuMem)

Basic MemoryLLM compresses and updates its fixed memory pool efficiently up to context lengths near 16k tokens; however, retention of knowledge beyond 20k tokens is empirically limited due to the random eviction of most-recently injected tokens. The M+ or SuMem extension addresses this by hybridizing short-term latent memory with a scalable externalized long-term memory (LTM) per layer, denoted Θ\Theta_\ell (Wang et al., 1 Feb 2025).

Dropped tokens from θ\theta_\ell are appended to Θ\Theta_\ell, indexed along with monotonic ages. Each retrieval step dynamically extracts relevant vectors from Θ\Theta_\ell via a co-trained two-tower retriever (key and query projection heads), selects the top-K0K_0 by similarity, sorts by chronological order, and fuses them—alongside θ\theta_\ell—in the cross-attention pattern. This hybrid approach increases the effective memory horizon by an order of magnitude (>160>160k tokens with M=150M=150k per layer), while GPU memory usage remains within practical bounds via CPU offload and selective GPU transfer. Precision of the retriever is maintained using joint training and contrastive loss across relevant/irrelevant token splits (Wang et al., 1 Feb 2025).

4. Empirical Evaluation: Knowledge Editing, Retention, and Long-Context QA

MemoryLLM and its variants are evaluated across editing, retention, and reasoning protocols:

  • Model Editing Benchmarks (ZsRE, CounterFact): After direct fact injection, MemoryLLM-7B achieves harmonic post-edit scores of 79.2 (ZsRE) and 75.3 (CounterFact), outperforming established methods such as FT, ROME, and IKE (Wang et al., 2024).
  • Long-Context Question Answering: On LongBench (512–65,536 tokens), MemoryLLM’s F1 rises monotonically with context length, outperforming LongLLaMA, OpenLLaMA, and LongLoRA on four out of six datasets (Wang et al., 2024).
  • Customized Retention: When injected facts are repeatedly diluted with up to t=20t=20 distractor updates, accuracy decays in line with (1K/N)t1(1 - K/N)^{t-1}, preserving ≈99% correctness after 20 updates (Wang et al., 2024).
  • Operational Integrity: No measurable degradation or drift is observed after 106\sim 10^6 online memory updates, including full-regime cycling over 2,250 SQuAD and 1,004 NaturalQA injections (tracked up to 650,000 steps) (Wang et al., 2024).
  • Scalable Retention: SuMem achieves >50% accuracy out to >160k tokens, whereas MemoryLLM and other approaches drop to zero beyond 30k tokens. On LongBook-QA and Event-QA, SuMem exceeds the F1/accuracy of all tested Llama- and retrieval-based competitors (Wang et al., 1 Feb 2025).

5. Algorithmic and Resource Optimization

MemoryLLM’s update and retrieval procedures operate entirely within the transformer’s architecture, requiring no auxiliary gradient heads or external memory orchestrators. The shift–append logic is architectural, ensuring O(1) update per injection and predictable memory scaling. SuMem’s external LTM is efficiently managed by keeping (token, embedding) pairs and keys on CPU, transferring only retrieved slices to GPU per generation step, with ≤3% latency overhead observed at 128k context sizes (Wang et al., 1 Feb 2025). Ablations confirm the necessity of co-trained retrievers (vs. simple attention scans) for high recall at large memory sizes.

6. Design Trade-Offs and Limitations

  • Memory Size: Even with optimal N/KN/K trade-offs, current implementations maintain ∼1B parameter memory pools per model, corresponding to GPU requirements of ≥48GB for pure latent approaches (Wang et al., 2024).
  • Retention vs. Compression: Lower KK increases per-update retention but may reduce per-injection capacity for high-value data.
  • Long-Term vs. Short-Term Memory: SuMem offloads most history for cost-efficiency, but retrieval latency and large-scale memory management remain open engineering challenges (Wang et al., 1 Feb 2025).
  • Extension to Multimodal/Instructional Paradigms: MemoryLLM and SuMem are primarily text-only; future work is required for vision or audio (Wang et al., 1 Feb 2025, Wang et al., 2024).

7. Implementation and Reproducibility

MemoryLLM is implemented in PyTorch, using the HuggingFace Transformers API. The update mechanism is a minimal, non-intrusive patch: last-K memory tokens are extracted, transformer layers are run forward, memory is shifted and appended per layer, and embeddings are stored for subsequent retrieval (Wang et al., 2024). SuMem’s hybrid memory requires CPU–GPU transfer logic and trained retrieval towers, but otherwise reuses the MemoryLLM foundation (Wang et al., 1 Feb 2025). Open-source codebases are provided for end-to-end training, self-update, and evaluation.


Together, the MemoryLLM framework and its extensions illustrate a principled approach to augmenting transformers with fully differentiable, updatable, and persistent memory pools, enabling both high-fidelity knowledge injection and long-context temporal reasoning far beyond the standard transformer context window (Wang et al., 2024, Wang et al., 1 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemoryLLM.