UltraMemV2: Efficient Memory Layer for Transformers

Updated 29 August 2025

UltraMemV2 is a memory-layer architecture for Transformers that integrates memory layers into every block and simplifies value expansion using a single linear projection.
It employs FFN-based value processing and principled parameter initialization to ensure training stability and efficient scaling up to 120B parameters.
The design achieves MoE-level performance on memory-intensive benchmarks by reducing memory access costs and optimizing activation density.

UltraMemV2 is a redesigned memory-layer architecture for large-scale neural networks, specifically Transformers, that achieves parity with state-of-the-art Mixture of Experts (MoE) models in both efficiency and performance, with substantially lower memory access costs. Unlike previous memory-based approaches—including UltraMem—UltraMemV2 introduces five key innovations: integration of memory layers into every block, a simplified value expansion scheme, FFN-based value processing adapted from PEER, principled parameter initialization for stable training, and an optimized memory-to-FFN computation ratio. UltraMemV2 demonstrates clear empirical gains in memory-intensive benchmarks while maintaining scalable performance for models up to 120B parameters.

1. Architectural Innovations

UltraMemV2 introduces the following core architectural modifications:

Memory Layer Integration: Memory layers are inserted into every transformer block, ensuring each block merges standard feed-forward output with a distinct "memory-enhanced" signal. This architectural coupling increases expressive capacity without incurring high inference-time memory access.
Simplified Value Expansion: The Implicit Value Expansion (IVE) from UltraMem, previously using multiple linear projections, is replaced by a single linear projector. The calculation for memory output is streamlined; if the UltraMem formulation is $o = \sum_{i \in \mathcal{J}} s_i \cdot (\mathrm{SiLU}(x U_i) \times V_i^\top)$ , UltraMemV2 replaces the multiple expansion steps by retaining a multihead mechanism but with shared row/column keys and a single mapping.
FFN-based Value Processing (PEER): Value embeddings are processed by a feed-forward network (FFN) with a single inner dimension, paralleling the FFN in standard Transformer blocks (SwiGLU-style). The output is expressed as $o = W^\top (V^\top \cdot ((P x) \otimes \hat{s}))$ , which both matches computational structure with FFNs and improves parameter efficiency.
Principled Initialization: UltraMemV2 uses a derived initialization for memory layer parameters to ensure output activation variance remains matched to the FFN. The variance formula:

$\sigma_{\text{mem}}^2 = (0.16 + 0.16 \sigma_s^2) \cdot \sigma_V^4 \cdot d_{\text{pre}} \cdot k \cdot n_{\text{head}} \cdot d_v / h$

as detailed in the Appendix, is chosen to prevent divergence and ensure stability across layers.

Optimized Memory-to-FFN Ratio: UltraMemV2 increases the density of activated values ("TopM" count), demonstrating that activation density is more influential than total sparse parameter count. Adjustments to the memory and FFN computation ratio are made throughout the architecture to match or exceed MoE performance.

2. Performance Characteristics

Extensive evaluation of UltraMemV2 establishes parity with strong MoE baselines (activating 8 experts per token) under equivalent compute and parameter budgets. Key results include:

General Benchmarks: On OpenBench and HardBench, UltraMemV2 matches or slightly surpasses SeedMoE models after continued training on high-quality tokens.
Memory-Intensive Tasks: Performance gains are quantified, with +1.6 points in long-context memorization, +6.2 in multi-round memorization, and +7.9 in in-context learning benchmarks. These gains reflect improved retention, context utilization, and ability to reason over extended dialogue or few-shot training sequences.
Training Stability: Parameter initialization ensures deep architectures remain stable, with both "Pre-values" and "Values" variances carefully matched to FFN output statistics.

3. Handling Memory-Intensive Benchmarks

UltraMemV2 demonstrates clear superiority in memory-bound workloads:

Long-context Memorization: The architecture supports linear scaling of memory access with context length, up to 32K tokens, via efficient top-M retrieval. The relevant formula for memory selection in these tasks is:

$S_{grid} = \sigma_{\text{TopM}}(S_{row}^\top C S_{col})$

This mechanism enables robust retrieval of contextually relevant embeddings for high-fidelity output.

Multi-round Memorization & In-context Learning: By streamlining the IVE and leveraging FFN-based value transformations, UltraMemV2 achieves higher multi-turn retention and in-context adaptation scores compared to MoE and prior memory networks.

4. Scalability and Activation Density

The UltraMemV2 architecture is validated at the highest reported scale:

Parameter Scaling: Models with up to 120B sparse parameters and 2.5B activated parameters have been deployed, demonstrating efficient usage of large parameter spaces.
Activation Density: Increasing activated values per layer (e.g., Top768 over Top256) yields consistent improvement, even with lower sparse parameter volumes. The retrieval operation for activated values:

$o = W^\top (V^\top \hat{s})$

describes how computation is efficiently distributed, with empirical results showing memory access cost grows slowly with sequence length.

5. Comparative Efficiency

Key comparisons between UltraMemV2 and MoE highlight the architecture’s strengths:

Memory Access Cost: MoE requires costly token routing to multiple expert FFNs. UltraMemV2, in contrast, activates a predetermined set of embeddings from a sparse but high-capacity table, achieving lower latency and power consumption.
Parameter Activation vs. Sparse Capacity: UltraMemV2 directly demonstrates that increasing activation density, not just total parameter count, is the critical variable for performance optimization.
Computational Footprint: Through the combination of single linear value projection, FFN-based processing, and activation sparsity, UltraMemV2 matches dense FFN costs while achieving the accuracy of 8-expert MoE models.

Model	Memory Access Cost	Parameter Activation	Long-context Performance
MoE-8ex	High	Token-routed experts	Baseline
UltraMemV2	Low	Fixed TopM values	+1.6 (over baseline)

UltraMemV2 thus achieves favorable inference speed and accuracy, with clear advantages on memory-heavy tasks.

6. Significance and Forward Outlook

UltraMemV2 closes the performance gap between memory-layer architectures and advanced MoE models, supporting both efficient and scalable sparse computation. The architectural design supports up to 120B parameters with minimal memory access and demonstrates that activation density—defined as the number of values activated per token—is more consequential for real-world performance than the sheer total of sparse parameters.

A plausible implication is that future large-scale neural architectures may shift focus from maximizing global sparse capacity to optimizing per-token activation density, potentially informing new approaches in distributed model training and hardware design for memory-efficient inference. The methodological advances in initialization, blockwise integration, and linear-value processing suggest UltraMemV2 provides a template for next-generation sparse models tailored for sequence modeling, multi-turn tasks, and long-context retention.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to UltraMemV2.