Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated DeltaNet-2: Efficient Linear Attention

Updated 2 July 2026
  • The paper introduces independent channel-wise erase and write gates that enable flexible memory updates, overcoming limitations of previous single-gate models.
  • It demonstrates state-of-the-art performance with improved metrics such as a WikiText PPL of 15.90 and superior retrieval accuracy compared to Gated DeltaNet and KDA.
  • Empirical evaluations show that the model scales efficiently for long sequences, employing chunkwise WY updates to maintain high throughput on retrieval-heavy tasks.

Gated DeltaNet-2 is a linear attention architecture that extends the delta-rule fast-weight approach by introducing channel-wise decoupled erase and write mechanisms in the memory update, achieving state-of-the-art performance among linear recurrent models, particularly on long-context and retrieval-heavy tasks. It generalizes previous models such as Gated DeltaNet and Kimi Delta Attention (KDA) by inheriting their adaptive, channel-wise forgetting but addresses the key deficiency of those approaches: the use of a single scalar gate to jointly control memory erasure and writing. Instead, Gated DeltaNet-2 deploys independent erase gates for each key channel and write gates for each value channel, thus enabling more flexible and expressive memory dynamics (Hatamizadeh et al., 21 May 2026).

1. Motivation and Relation to Previous Methods

Linear attention models replace traditional softmax attention's unbounded state with a fixed-size, recurrent associative memory (Mt∈Rdk×dvM_t \in \mathbb{R}^{d_k \times d_v}), leading to linear computation time and fixed memory during inference. In standard delta-rule models, memory updates actively overwrite the association addressed by the current key, e.g., in DeltaNet and Gated DeltaNet, while KDA sharpened adaptive forgetting by introducing a channel-wise decay vector αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}.

However, these methods tie two distinct operations—erasure on the key axis and writing on the value axis—into a single scalar gate βt\beta_t. This design limits control, particularly as the context grows long and memory interference increases. Gated DeltaNet-2 disentangles these processes by introducing separate, channel-wise gates:

  • Erase gate bt∈[0,1]dkb_t \in [0,1]^{d_k} for key-wise erasure
  • Write gate wt∈[0,1]dvw_t \in [0,1]^{d_v} for value-wise writing
  • Retaining channel-wise decay αt\alpha_t for adaptive forgetting.

Special cases of this formulation recover KDA when bt=wt=βt1b_t = w_t = \beta_t\mathbf{1} (with αt\alpha_t channel-wise) and Gated DeltaNet when αt=αt1\alpha_t = \alpha_t\mathbf{1} is also scalar (Hatamizadeh et al., 21 May 2026).

2. Mathematical Formulation

Let time step tt have key αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}0 and value αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}1, with input features αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}2. The gates are computed as:

  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}3
  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}4
  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}5

The recurrent memory state is decayed as:

  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}6

Read and write vectors, gated channel-wise:

  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}7
  • αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}8

The full Gated Delta Rule-2 update is: αt∈(0,1]dk\alpha_t \in (0,1]^{d_k}9 This can be equivalently written as an explicit "erase + write" decomposition: βt\beta_t0 When both gates reduce to the same scalar, the update recovers KDA; further restricting decay to a scalar recovers Gated DeltaNet (Hatamizadeh et al., 21 May 2026).

3. Fast-Weight Perspective and Chunkwise WY Algorithm

Gated DeltaNet-2 implements a fast-weight memory update as the solution to a local quadratic objective at each step: βt\beta_t1 The optimizer is: βt\beta_t2

For long sequences, chunkwise WY (Woodbury) updates allow for parallel computation by splitting a sequence of length βt\beta_t3 into chunks of size βt\beta_t4 and applying cumulative, component-wise decay:

  • βt\beta_t5, βt\beta_t6
  • Updates combine as a series of rank-one corrections enabling efficient factorization and chunk outputs via triangular forms, which are hardware-friendly (Hatamizadeh et al., 21 May 2026).

Computational complexity remains βt\beta_t7 for memory updates and βt\beta_t8 for chunk solves, with βt\beta_t9 recurrent state memory and bt∈[0,1]dkb_t \in [0,1]^{d_k}0 per chunk.

4. Gate-Aware Backward Pass

Gated DeltaNet-2 requires a backward pass that explicitly accumulates gradients with respect to the separate gates bt∈[0,1]dkb_t \in [0,1]^{d_k}1 and bt∈[0,1]dkb_t \in [0,1]^{d_k}2 inside the WY factors. For memory- and throughput-efficient parallel training, the gradient computation is adapted so that

bt∈[0,1]dkb_t \in [0,1]^{d_k}3

and similar operations for bt∈[0,1]dkb_t \in [0,1]^{d_k}4 and decays. Gradients with respect to logarithmic decay factors are computed via a reverse cumulative sum. The triangular solves and output computations share the efficient vector-Jacobian structure of KDA (Hatamizadeh et al., 21 May 2026).

5. Empirical Performance

Gated DeltaNet-2 was evaluated at 1.3B parameters trained on 100B FineWeb-Edu tokens. Results demonstrate dominance over Gated DeltaNet, KDA, Mamba-2, and Mamba-3 variants in both recurrent and hybrid (2K sliding-window attention) configurations:

Model WikiText PPL LAMBADA PPL / Acc Commonsense Acc RULER S-NIAH-2@4K RULER MK-NIAH-1@4K Retrieval Avg
Gated DeltaNet-2 15.90 11.41 / 48.09% 53.11% 93.0% 37.8% 29.88%
Gated DeltaNet 16.40 11.88 / 47.13% 52.85% 87.2% 27.8% 28.09%
KDA 16.81 12.22 / 47.27% 51.98% 89.0% 28.0% 28.67%

On synthetic multi-key retrieval benchmarks (RULER MK-NIAH-1 @ 4K), Gated DeltaNet-2 achieves 37.8% (vs. 27.8% for Gated DeltaNet and 28.0% for KDA), highlighting improved resistance to interference in long contexts. Throughput (H100, hybrid model) reaches 38.0 Kt/s for 2K×8 batches, with near-flat scaling at long sequence lengths (dropping only to 36.1 Kt/s at 16K×1), approximately 7% below KDA but vastly outperforming full softmax attention under long sequence loads (Hatamizadeh et al., 21 May 2026).

6. Ablation Studies and Analysis

Ablation results support the importance of full gate decoupling:

  • Using a scalar bt∈[0,1]dkb_t \in [0,1]^{d_k}5 (erase gate) and channel-wise bt∈[0,1]dkb_t \in [0,1]^{d_k}6 (write gate) yields degraded performance (Wiki ppl 16.55, S-NIAH-2@4K=90.6%).
  • Channel-wise bt∈[0,1]dkb_t \in [0,1]^{d_k}7 with scalar bt∈[0,1]dkb_t \in [0,1]^{d_k}8 recovers most gains (Wiki ppl 16.12, S-NIAH-2@4K=92.1%).
  • Full channel-wise decoupling is optimal (Wiki ppl 15.90, S-NIAH-2@4K=93.0%).

The memory edit is therefore primarily mediated by key-side (erase) gating, while value-side (write) gating contributes additional improvement. Allowing bt∈[0,1]dkb_t \in [0,1]^{d_k}9 produced no consistent gains at this model size. This suggests the principal expressivity arises from the key-gated erasure pathway (Hatamizadeh et al., 21 May 2026).

7. Architectural Impact and Connections

Gated DeltaNet-2 generalizes and subsumes both KDA and Gated DeltaNet, reduces to FGwt∈[0,1]dvw_t \in [0,1]^{d_v}0-GDNwt∈[0,1]dvw_t \in [0,1]^{d_v}1 in the limit where erasure and write gates control keys and values independently, and fits within the broader trend of per-channel control in sequence models. Whereas FGwt∈[0,1]dvw_t \in [0,1]^{d_v}2-GDN and FGwt∈[0,1]dvw_t \in [0,1]^{d_v}3-GDNwt∈[0,1]dvw_t \in [0,1]^{d_v}4 focus on per-channel step sizes (drawing analogies to AdaGrad/Adam in adaptive optimization), Gated DeltaNet-2 implements direct channel-wise gating, yielding similar fine-grained adaptation in memory dynamics at a modest incremental runtime cost (Sun et al., 21 Apr 2026).

Hybrid variants incorporating sliding-window attention retain the benefits of Gated DeltaNet-2 on long-range tasks and further boost local context modeling. This architectural flexibility, combined with efficient chunkwise implementation and state-of-the-art accuracy on retrieval and long-context understanding, position Gated DeltaNet-2 as a leading approach for scalable linear recurrent attention models (Hatamizadeh et al., 21 May 2026, Yang et al., 2024, Sun et al., 21 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated DeltaNet-2.