Gated DeltaNet-2: Efficient Linear Attention
- The paper introduces independent channel-wise erase and write gates that enable flexible memory updates, overcoming limitations of previous single-gate models.
- It demonstrates state-of-the-art performance with improved metrics such as a WikiText PPL of 15.90 and superior retrieval accuracy compared to Gated DeltaNet and KDA.
- Empirical evaluations show that the model scales efficiently for long sequences, employing chunkwise WY updates to maintain high throughput on retrieval-heavy tasks.
Gated DeltaNet-2 is a linear attention architecture that extends the delta-rule fast-weight approach by introducing channel-wise decoupled erase and write mechanisms in the memory update, achieving state-of-the-art performance among linear recurrent models, particularly on long-context and retrieval-heavy tasks. It generalizes previous models such as Gated DeltaNet and Kimi Delta Attention (KDA) by inheriting their adaptive, channel-wise forgetting but addresses the key deficiency of those approaches: the use of a single scalar gate to jointly control memory erasure and writing. Instead, Gated DeltaNet-2 deploys independent erase gates for each key channel and write gates for each value channel, thus enabling more flexible and expressive memory dynamics (Hatamizadeh et al., 21 May 2026).
1. Motivation and Relation to Previous Methods
Linear attention models replace traditional softmax attention's unbounded state with a fixed-size, recurrent associative memory (), leading to linear computation time and fixed memory during inference. In standard delta-rule models, memory updates actively overwrite the association addressed by the current key, e.g., in DeltaNet and Gated DeltaNet, while KDA sharpened adaptive forgetting by introducing a channel-wise decay vector .
However, these methods tie two distinct operations—erasure on the key axis and writing on the value axis—into a single scalar gate . This design limits control, particularly as the context grows long and memory interference increases. Gated DeltaNet-2 disentangles these processes by introducing separate, channel-wise gates:
- Erase gate for key-wise erasure
- Write gate for value-wise writing
- Retaining channel-wise decay for adaptive forgetting.
Special cases of this formulation recover KDA when (with channel-wise) and Gated DeltaNet when is also scalar (Hatamizadeh et al., 21 May 2026).
2. Mathematical Formulation
Let time step have key 0 and value 1, with input features 2. The gates are computed as:
- 3
- 4
- 5
The recurrent memory state is decayed as:
- 6
Read and write vectors, gated channel-wise:
- 7
- 8
The full Gated Delta Rule-2 update is: 9 This can be equivalently written as an explicit "erase + write" decomposition: 0 When both gates reduce to the same scalar, the update recovers KDA; further restricting decay to a scalar recovers Gated DeltaNet (Hatamizadeh et al., 21 May 2026).
3. Fast-Weight Perspective and Chunkwise WY Algorithm
Gated DeltaNet-2 implements a fast-weight memory update as the solution to a local quadratic objective at each step: 1 The optimizer is: 2
For long sequences, chunkwise WY (Woodbury) updates allow for parallel computation by splitting a sequence of length 3 into chunks of size 4 and applying cumulative, component-wise decay:
- 5, 6
- Updates combine as a series of rank-one corrections enabling efficient factorization and chunk outputs via triangular forms, which are hardware-friendly (Hatamizadeh et al., 21 May 2026).
Computational complexity remains 7 for memory updates and 8 for chunk solves, with 9 recurrent state memory and 0 per chunk.
4. Gate-Aware Backward Pass
Gated DeltaNet-2 requires a backward pass that explicitly accumulates gradients with respect to the separate gates 1 and 2 inside the WY factors. For memory- and throughput-efficient parallel training, the gradient computation is adapted so that
3
and similar operations for 4 and decays. Gradients with respect to logarithmic decay factors are computed via a reverse cumulative sum. The triangular solves and output computations share the efficient vector-Jacobian structure of KDA (Hatamizadeh et al., 21 May 2026).
5. Empirical Performance
Gated DeltaNet-2 was evaluated at 1.3B parameters trained on 100B FineWeb-Edu tokens. Results demonstrate dominance over Gated DeltaNet, KDA, Mamba-2, and Mamba-3 variants in both recurrent and hybrid (2K sliding-window attention) configurations:
| Model | WikiText PPL | LAMBADA PPL / Acc | Commonsense Acc | RULER S-NIAH-2@4K | RULER MK-NIAH-1@4K | Retrieval Avg |
|---|---|---|---|---|---|---|
| Gated DeltaNet-2 | 15.90 | 11.41 / 48.09% | 53.11% | 93.0% | 37.8% | 29.88% |
| Gated DeltaNet | 16.40 | 11.88 / 47.13% | 52.85% | 87.2% | 27.8% | 28.09% |
| KDA | 16.81 | 12.22 / 47.27% | 51.98% | 89.0% | 28.0% | 28.67% |
On synthetic multi-key retrieval benchmarks (RULER MK-NIAH-1 @ 4K), Gated DeltaNet-2 achieves 37.8% (vs. 27.8% for Gated DeltaNet and 28.0% for KDA), highlighting improved resistance to interference in long contexts. Throughput (H100, hybrid model) reaches 38.0 Kt/s for 2K×8 batches, with near-flat scaling at long sequence lengths (dropping only to 36.1 Kt/s at 16K×1), approximately 7% below KDA but vastly outperforming full softmax attention under long sequence loads (Hatamizadeh et al., 21 May 2026).
6. Ablation Studies and Analysis
Ablation results support the importance of full gate decoupling:
- Using a scalar 5 (erase gate) and channel-wise 6 (write gate) yields degraded performance (Wiki ppl 16.55, S-NIAH-2@4K=90.6%).
- Channel-wise 7 with scalar 8 recovers most gains (Wiki ppl 16.12, S-NIAH-2@4K=92.1%).
- Full channel-wise decoupling is optimal (Wiki ppl 15.90, S-NIAH-2@4K=93.0%).
The memory edit is therefore primarily mediated by key-side (erase) gating, while value-side (write) gating contributes additional improvement. Allowing 9 produced no consistent gains at this model size. This suggests the principal expressivity arises from the key-gated erasure pathway (Hatamizadeh et al., 21 May 2026).
7. Architectural Impact and Connections
Gated DeltaNet-2 generalizes and subsumes both KDA and Gated DeltaNet, reduces to FG0-GDN1 in the limit where erasure and write gates control keys and values independently, and fits within the broader trend of per-channel control in sequence models. Whereas FG2-GDN and FG3-GDN4 focus on per-channel step sizes (drawing analogies to AdaGrad/Adam in adaptive optimization), Gated DeltaNet-2 implements direct channel-wise gating, yielding similar fine-grained adaptation in memory dynamics at a modest incremental runtime cost (Sun et al., 21 Apr 2026).
Hybrid variants incorporating sliding-window attention retain the benefits of Gated DeltaNet-2 on long-range tasks and further boost local context modeling. This architectural flexibility, combined with efficient chunkwise implementation and state-of-the-art accuracy on retrieval and long-context understanding, position Gated DeltaNet-2 as a leading approach for scalable linear recurrent attention models (Hatamizadeh et al., 21 May 2026, Yang et al., 2024, Sun et al., 21 Apr 2026).