MemoryLLM Architecture: Persistent Memory in LLMs
- MemoryLLM is an architectural framework that incorporates a fixed-size, learnable memory pool into transformer layers, enabling efficient dynamic knowledge integration.
- The system fuses memory tokens with current activations via self-attention, supporting real-time updates and controlled, exponential forgetting without retraining.
- Experimental ablations show that tuning memory tokens and update batch sizes crucially impacts long-term retention and GPU resource management.
MemoryLLM is an architectural framework for LLMs that infuses latent-space persistent memory directly within the multi-layer transformer backbone. It is designed to address the static nature of conventional LLMs post-deployment by enabling efficient, large-scale self-updatable memory, facilitating rapid assimilation of new knowledge and controllable forgetting. Developed around a memory pool concept interleaved with standard transformer computations, MemoryLLM advances mechanisms for information integration, retention, and update at inference without extensive retraining or backpropagation steps (Wang et al., 7 Feb 2024).
1. Architectural Foundation: Latent-Space Memory Integration
MemoryLLM extends a standard decoder-only transformer, such as Llama2-7B or Llama-3.1-8B, with a fixed-size learnable memory pool θ. For a transformer of L layers and hidden size d, each layer l is associated with a memory pool θₗ ∈ ℝ{N×d}, where N is the number of memory tokens per layer. The architectural configuration in the canonical MemoryLLM instance is L=32, d=4096, and N=7680 (for Llama2-7B), resulting in a total memory pool footprint of approximately 1B parameters. Each memory token θₗ,i captures compressed knowledge at layer l and is updated dynamically as the model ingests new text (Wang et al., 7 Feb 2024, Wang et al., 1 Feb 2025).
The system operates in two principal phases:
- Generation (read) phase: For an input sequence, per-layer token hidden-states Hₗ ∈ ℝ{n_x×d} are concatenated with θₗ, and self-attention is performed over the combined sequence, enabling model states to attend both to current activations and the persistent layer-wise memory.
- Self-update (write) phase: New knowledge chunks are injected by extracting K recent slots from θₗ, concatenating them with new token hidden states, processing via a forward pass, and appending the resulting memory excerpts back into θₗ while discarding K old slots selected at random.
This approach embeds memory across all transformer layers, providing both hierarchical and distributed knowledge storage.
2. Memory Pool Update, Compression, and Forgetting Mechanism
MemoryLLM's self-update is governed by an explicit compression/overwrite algorithm:
- Extract last K entries eₗ = θₗ[\,N–K:N\,] ∈ ℝ{K×d};
- Concatenate eₗ with new hidden states Hₗ(x_c), forming Iₗ = [eₗ; Hₗ];
- Process Iₗ through φₗ, yielding outputs Oₗ = φₗ(Iₗ), with updated excerpts eₗ′ = Oₗ[–K: , :];
- Drop K random rows from θₗ and append eₗ′, reconstructing θₗ;
- Repeat for all layers and new context chunks.
Formally, the update can be expressed as
where U{(l)} stochastically drops K slots and appends K new compressed tokens (Wang et al., 7 Feb 2024).
Knowledge retention follows an exponential forgetting curve; after t updates, an injected chunk’s retention is
given K ≪ N. Larger N or smaller K slows forgetting. Importantly, this scheme enables frequent updates (e.g., after every batch) without affecting the frozen transformer backbone φ parameters.
3. Memory Addressing, Attention Patterns, and Generation
During generation, the memory and token states are fused in-place via layerwise self-attention:
- Queries Q = HₗW_Q, Keys and Values K = [Hₗ; θₗ]W_K, V = [Hₗ; θₗ]W_V.
- Attention:
- Every input token attends to all N memory slots (and other tokens) without additional gating or address weighting.
No specialized index, gating, or learned addressing is performed: memory slots are treated identically under standard attention, yielding a fully differentiable, stateless method that supports high-throughput, long-context operations with minimal architectural intervention (Wang et al., 7 Feb 2024).
4. Scalability, Hyperparameters, and Empirical Ablations
Key hyperparameters are:
- Memory tokens per layer (N): 7680 (canonical), with tested values 2560–7680.
- Update batch (K): typically 256 (sometimes 512).
- Depth (L): 32.
- Hidden dimension (d): 4096.
Scaling ablations reveal:
- Larger N (more slots): slower forgetting, larger memory pool, but higher GPU memory cost.
- Smaller K (more compression per update): increases memory retention at the expense of capacity for new knowledge.
- Only memory pool parameters θ are updated post-pretraining—Transformer φ is frozen—leading to update efficiency and stability.
- Empirical testing confirmed long-term retention persists for substantial numbers of memory injections, with no measurable degradation after nearly a million updates (Wang et al., 7 Feb 2024).
5. Extensions: Long-Term Memory, Retrieval, and the M+ Framework
MemoryLLM’s architecture is limited by the fixed memory pool capacity; retention degrades for sequences >16k–20k tokens. The M+ extension introduces scalable long-term memory (LTM) and dynamic retrieval:
- All tokens dropped from θₗ during update are stored in a growing Θₗ (max capacity M, e.g., 150,000). Each is tagged by age.
- At generation, a learned retriever (MLP key and query projectors) selects top-K₀ (e.g., 2560) relevant LTM tokens, which are appended to the per-layer memory pool for cross-attention.
- Retriever is co-trained with a discriminative loss, aligning relevant θ₊ close to the query and repulsing irrelevant θ₋.
- Multi-LoRA adapters decouple memory adaptation for write and generation phases.
- Empirically, M+ extends effective context window with minimal GPU overhead (20k→160k tokens, <5% overhead), retaining performance on long-horizon tasks (Wang et al., 1 Feb 2025).
6. Memory Taxonomy, Evaluation, and Governance
MemoryLLM fits within a broad taxonomy composed of four memory substrates:
- Parametric: Static model weights.
- Contextual: KV-cache/in-context examples.
- External: Persistent stores queried at inference (retrieval-augmented models).
- Procedural/Episodic: Event logs maintained over sessions.
A formal memory quadruple (L—location, P—persistence, W—write path, A—access path, C—controllability) specifies the persistence, updatability, and operational regime of each memory type.
MemoryLLM’s self-updatable pool and its M+ extension operationalize both contextual (short-term) and external (retrieval-augmented, LTM) memory, enabling dynamic, layer-internal, and externally retrievable knowledge to coexist and be evaluated in unified protocols. Evaluation metrics include closed-book recall, edit differential, length- and position-performance curves, and multi-layered recall/attribution for retrieval and episodic memory (Zhang et al., 23 Sep 2025).
Update and governance cycles are orchestrated through Dynamic Memory Management with Governance (DMM-Gov), integrating PEFT, DAPT/TAPT, targeted model editing, and retrieval-based augmentation within an auditable monitoring and rollback loop.
7. Significance, Hardware Implications, and Limitations
MemoryLLM advances the state of dynamic, persistent, and self-updating memory in LLMs:
- Enables efficient knowledge incorporation without retraining or backpropagation post-deployment.
- Provides controlled forgetting at a layerwise, token-level granularity.
- Maintains operational integrity across high update volumes, supporting long-term inference tasks and domain adaptation.
Hardware resources are dominated by the memory pool size. Scaling beyond tens of thousands of memory tokens per layer imposes increasing GPU memory load, mitigated in M+ by offloading LTM indices and computation to CPU and refined retrieval mechanisms.
A current limitation is unidirectional forgetting—retained memory decays exponentially with updates—constraining effective history window in the base model to ~20k tokens. The M+ extension breaks this bottleneck, but introduces retriever training and management complexity.
MemoryLLM provides a concrete template for future LLM architectures integrating persistent, interpretable, auditable, and updatable memory at scale, supporting research in long-term knowledge retention, continual learning, and robust deployment governance (Wang et al., 7 Feb 2024, Wang et al., 1 Feb 2025, Zhang et al., 23 Sep 2025).