Gated DeltaNet-2: Advanced Memory Architecture
- The paper introduces separate channel-wise erase and write gates to decouple memory erasure and update, enhancing fine-grained fast-weight attention.
- It employs chunk-parallel computation with fused kernels and log-parameterized decay to maintain hardware efficiency and numerical stability in long-context settings.
- Empirical benchmarks demonstrate improved associative recall, language modeling, and retrieval performance, setting new standards over previous delta-based architectures.
Gated DeltaNet-2 is a second-generation linear memory architecture for sequence modeling, extending and generalizing the Gated DeltaNet and Kimi Delta Attention (KDA) families. By introducing separate channel-wise gates for erasure and writing within fast-weight attention, Gated DeltaNet-2 achieves fine-grained, axis-specific memory control. This separation leads to improved associative recall and long-context handling, while maintaining hardware efficiency via chunk-parallel computation and fused kernels. The model sets new benchmarks in language modeling, retrieval, and generalization under long-range interference.
1. Motivations and Fast-Weight Foundations
Traditional linear attention compresses sequence history into a fixed-size recurrent state, typically accumulating key-value outer products without explicit memory management. Delta-rule models, such as DeltaNet and Gated DeltaNet, improve over naive accumulation by first "reading" the old content at the current key, subtracting (erasing) a fraction, and subsequently writing the new value. In earlier models, a single scalar "delta" gate controlled both the amount of erasure (on the key axis) and the strength of write (on the value axis) for each step. KDA sharpened the decay control to channel-wise vectors, but retained a single scalar for write, limiting flexibility in state editing (Hatamizadeh et al., 21 May 2026).
Gated DeltaNet-2 addresses this fundamental limitation by introducing distinct channel-wise erase and write gates—enabling directional and magnitude distinction between memory removal and new information injection. When both reduce to scalars, GDN-2 recovers KDA and Gated DeltaNet as special cases. This decoupling is central to its superiority in long-context, high-interference settings.
2. Mathematical Formalism
A. Gate Parameterizations
At each timestep , given per-token feature :
- Erase gate:
- Write gate:
- Decay (channel-wise): Use a log-parameterization for numerical stability:
where are head-wise learnable parameters.
B. Gated Delta Rule-2 Recurrence
Let be the previous fast-weight state. Compute
- Normalized key , value
- Gated vectors: , 0
Apply decay:
1
Memory update (Gated Delta Rule-2, channel-wise asymmetric):
2
Output read: 3
C. Efficient Chunkwise Parallelism
Sequences are split into chunks of length 4 and computed with a generalized WY algorithm:
- Track cumulative log-decays 5, products 6
- Rescale 7 within each chunk: 8, 9
- State update for chunk terminal:
0
where 1.
All forward, backward, and state updates are implemented via fused dense kernels or small vector-Jacobian products, preserving high throughput (Hatamizadeh et al., 21 May 2026).
3. Comparison with Predecessor Architectures
| Variant | Decay Gate | Erase Gate | Write Gate | State Update |
|---|---|---|---|---|
| Gated DeltaNet | scalar 2 | scalar 3 | scalar 4 | 5 |
| Kimi Delta Attn | vector 6 | scalar 7 | scalar 8 | 9 |
| FG0-GDN | vector 1 | vector 2 | vector 3 | 4 |
| GDN-2 | vector 5 | vector 6 | vector 7 | 8 |
This progression culminates in Gated DeltaNet-2, which is strictly more expressive by allowing channel-wise and axis-specific decay, erase, and write, addressing the previously imposed coupling between erasure and write strengths (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026).
4. Block Composition and Implementation
Each block in Gated DeltaNet-2 consists of the following sequence:
- Gated DeltaNet-2 token mixer as described above
- MLP
- Optionally, Sliding-Window Attention (for hybrid variants)
- MLP
- RMS-norm and SiLU gating at output
Keys and queries are obtained via causal convolution, SiLU, and L2 normalization; values from a parallel conv + linear stack. Chunkwise state updates and outputs are performed with fused kernels in fp32, supporting head dimension 9, 16 heads per block, and chunk size 0 for kernel efficiency. All gates are computed via learned projections from the token input, with elementwise sigmoid (or log-exp for decays).
Backward pass collects gradients to gate parameters through the fused WY triangular solve, ensuring efficient and correct parallel optimization (Hatamizadeh et al., 21 May 2026).
5. Empirical Performance and Benchmarks
1.3B parameter GDN-2 models, trained on 100B tokens (FineWeb-Edu, max train length 4K), achieve:
- Language modeling (WikiText/LAMBADA perp.):
- Recurrent GDN-2: 15.90 / 11.41
- Hybrid GDN-2: 15.62 / 10.43
- Stronger than both Gated DeltaNet (16.40 / 11.89) and KDA (16.81 / 11.68)
- Commonsense reasoning (PIQA1BoolQ avg. acc): 53.11% (recurrent), 53.97% (hybrid)—highest among all considered variants
- Needle-in-a-haystack (RULER): Best retrieval performance, especially in multi-key and interference-prone settings; maintains high accuracy as context length increases
- Real-world retrieval (SWDE, SQuAD, TriviaQA, FDA, NQ, DROP): 29.88% recall (recurrent), 42.28% (hybrid)—both outperforming previous linear attention variants
- Throughput: Maintains 38–36Kt/s over 2K–16K tokens on H100, 2 slower than KDA, with all gating overhead handled via elementwise operations (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026)
Ablations confirm that channel-wise erase (3) yields the majority of the gain, while distinct writes (4) further boost performance, especially in retrieval and few-shot tasks. This suggests that axis-specific, learnable update rates are critical for robust associative memory under interference.
6. Geometric Perspective and Deep Residual Generalization
Gated DeltaNet-2 admits a geometric interpretation as a depthwise rank-1 operator, similar to Deep Delta Learning (Zhang et al., 1 Jan 2026). The delta operator 5 interpolates between identity, projection, and reflection, with the gate controlling spectral behavior:
- 6: identity mapping
- 7: projection onto 8
- 9: Householder reflection across 0
In DDL-style architectures, the residual is modulated by a synchronous, gated delta: 1. This admits enhanced gradient stability, fast convergence, and improved calibration, and GDN-2 can be interpreted as the sequential (recurrent) version of this geometric update (Zhang et al., 1 Jan 2026).
A plausible implication is that the deep connection between projection-based memory control and fast-weight indexing facilitates better matching of the memory update structure to the algebraic needs of sequence modeling.
7. Best Practices and Practical Considerations
- Both erase and write gates must be channel-wise for optimal associative recall and stability; scalar approximations degrade performance by 0.3–0.7 perplexity and multiple points of retrieval accuracy.
- Decay should be implemented via log-parameterization in fp32 to avoid roundoff in long-range contexts.
- Preferred kernel fusion strategies operate at chunk size 2 for maximal throughput.
- L2 normalization of keys/queries per head improves numerical stability.
- Chunkwise WY kernel and vector-Jacobian backward are essential to preserve both efficiency and differentiability in training.
Gated DeltaNet-2 demonstrates that independent, vectorized erasing and writing within fast-weight memory models are essential for overcoming interference and saturation in long-context settings, establishing a new standard among linear attention mechanisms (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026, Zhang et al., 1 Jan 2026).