Gated DeltaNet-2: Efficient Linear Attention

Updated 2 July 2026

The paper introduces independent channel-wise erase and write gates that enable flexible memory updates, overcoming limitations of previous single-gate models.
It demonstrates state-of-the-art performance with improved metrics such as a WikiText PPL of 15.90 and superior retrieval accuracy compared to Gated DeltaNet and KDA.
Empirical evaluations show that the model scales efficiently for long sequences, employing chunkwise WY updates to maintain high throughput on retrieval-heavy tasks.

Gated DeltaNet-2 is a linear attention architecture that extends the delta-rule fast-weight approach by introducing channel-wise decoupled erase and write mechanisms in the memory update, achieving state-of-the-art performance among linear recurrent models, particularly on long-context and retrieval-heavy tasks. It generalizes previous models such as Gated DeltaNet and Kimi Delta Attention (KDA) by inheriting their adaptive, channel-wise forgetting but addresses the key deficiency of those approaches: the use of a single scalar gate to jointly control memory erasure and writing. Instead, Gated DeltaNet-2 deploys independent erase gates for each key channel and write gates for each value channel, thus enabling more flexible and expressive memory dynamics (Hatamizadeh et al., 21 May 2026).

1. Motivation and Relation to Previous Methods

Linear attention models replace traditional softmax attention's unbounded state with a fixed-size, recurrent associative memory ( $M_t \in \mathbb{R}^{d_k \times d_v}$ ), leading to linear computation time and fixed memory during inference. In standard delta-rule models, memory updates actively overwrite the association addressed by the current key, e.g., in DeltaNet and Gated DeltaNet, while KDA sharpened adaptive forgetting by introducing a channel-wise decay vector $\alpha_t \in (0,1]^{d_k}$ .

However, these methods tie two distinct operations—erasure on the key axis and writing on the value axis—into a single scalar gate $\beta_t$ . This design limits control, particularly as the context grows long and memory interference increases. Gated DeltaNet-2 disentangles these processes by introducing separate, channel-wise gates:

Erase gate $b_t \in [0,1]^{d_k}$ for key-wise erasure
Write gate $w_t \in [0,1]^{d_v}$ for value-wise writing
Retaining channel-wise decay $\alpha_t$ for adaptive forgetting.

Special cases of this formulation recover KDA when $b_t = w_t = \beta_t\mathbf{1}$ (with $\alpha_t$ channel-wise) and Gated DeltaNet when $\alpha_t = \alpha_t\mathbf{1}$ is also scalar (Hatamizadeh et al., 21 May 2026).

2. Mathematical Formulation

Let time step $t$ have key $\alpha_t \in (0,1]^{d_k}$ 0 and value $\alpha_t \in (0,1]^{d_k}$ 1, with input features $\alpha_t \in (0,1]^{d_k}$ 2. The gates are computed as:

$\alpha_t \in (0,1]^{d_k}$ 3
$\alpha_t \in (0,1]^{d_k}$ 4
$\alpha_t \in (0,1]^{d_k}$ 5

The recurrent memory state is decayed as:

$\alpha_t \in (0,1]^{d_k}$ 6

Read and write vectors, gated channel-wise:

$\alpha_t \in (0,1]^{d_k}$ 7
$\alpha_t \in (0,1]^{d_k}$ 8

The full Gated Delta Rule-2 update is: $\alpha_t \in (0,1]^{d_k}$ 9 This can be equivalently written as an explicit "erase + write" decomposition: $\beta_t$ 0 When both gates reduce to the same scalar, the update recovers KDA; further restricting decay to a scalar recovers Gated DeltaNet (Hatamizadeh et al., 21 May 2026).

3. Fast-Weight Perspective and Chunkwise WY Algorithm

Gated DeltaNet-2 implements a fast-weight memory update as the solution to a local quadratic objective at each step: $\beta_t$ 1 The optimizer is: $\beta_t$ 2

For long sequences, chunkwise WY (Woodbury) updates allow for parallel computation by splitting a sequence of length $\beta_t$ 3 into chunks of size $\beta_t$ 4 and applying cumulative, component-wise decay:

$\beta_t$ 5, $\beta_t$ 6
Updates combine as a series of rank-one corrections enabling efficient factorization and chunk outputs via triangular forms, which are hardware-friendly (Hatamizadeh et al., 21 May 2026).

Computational complexity remains $\beta_t$ 7 for memory updates and $\beta_t$ 8 for chunk solves, with $\beta_t$ 9 recurrent state memory and $b_t \in [0,1]^{d_k}$ 0 per chunk.

4. Gate-Aware Backward Pass

Gated DeltaNet-2 requires a backward pass that explicitly accumulates gradients with respect to the separate gates $b_t \in [0,1]^{d_k}$ 1 and $b_t \in [0,1]^{d_k}$ 2 inside the WY factors. For memory- and throughput-efficient parallel training, the gradient computation is adapted so that

$b_t \in [0,1]^{d_k}$ 3

and similar operations for $b_t \in [0,1]^{d_k}$ 4 and decays. Gradients with respect to logarithmic decay factors are computed via a reverse cumulative sum. The triangular solves and output computations share the efficient vector-Jacobian structure of KDA (Hatamizadeh et al., 21 May 2026).

5. Empirical Performance

Gated DeltaNet-2 was evaluated at 1.3B parameters trained on 100B FineWeb-Edu tokens. Results demonstrate dominance over Gated DeltaNet, KDA, Mamba-2, and Mamba-3 variants in both recurrent and hybrid (2K sliding-window attention) configurations:

Model	WikiText PPL	LAMBADA PPL / Acc	Commonsense Acc	RULER S-NIAH-2@4K	RULER MK-NIAH-1@4K	Retrieval Avg
Gated DeltaNet-2	15.90	11.41 / 48.09%	53.11%	93.0%	37.8%	29.88%
Gated DeltaNet	16.40	11.88 / 47.13%	52.85%	87.2%	27.8%	28.09%
KDA	16.81	12.22 / 47.27%	51.98%	89.0%	28.0%	28.67%

On synthetic multi-key retrieval benchmarks (RULER MK-NIAH-1 @ 4K), Gated DeltaNet-2 achieves 37.8% (vs. 27.8% for Gated DeltaNet and 28.0% for KDA), highlighting improved resistance to interference in long contexts. Throughput (H100, hybrid model) reaches 38.0 Kt/s for 2K×8 batches, with near-flat scaling at long sequence lengths (dropping only to 36.1 Kt/s at 16K×1), approximately 7% below KDA but vastly outperforming full softmax attention under long sequence loads (Hatamizadeh et al., 21 May 2026).

6. Ablation Studies and Analysis

Ablation results support the importance of full gate decoupling:

Using a scalar $b_t \in [0,1]^{d_k}$ 5 (erase gate) and channel-wise $b_t \in [0,1]^{d_k}$ 6 (write gate) yields degraded performance (Wiki ppl 16.55, S-NIAH-2@4K=90.6%).
Channel-wise $b_t \in [0,1]^{d_k}$ 7 with scalar $b_t \in [0,1]^{d_k}$ 8 recovers most gains (Wiki ppl 16.12, S-NIAH-2@4K=92.1%).
Full channel-wise decoupling is optimal (Wiki ppl 15.90, S-NIAH-2@4K=93.0%).

The memory edit is therefore primarily mediated by key-side (erase) gating, while value-side (write) gating contributes additional improvement. Allowing $b_t \in [0,1]^{d_k}$ 9 produced no consistent gains at this model size. This suggests the principal expressivity arises from the key-gated erasure pathway (Hatamizadeh et al., 21 May 2026).

7. Architectural Impact and Connections

Gated DeltaNet-2 generalizes and subsumes both KDA and Gated DeltaNet, reduces to FG $w_t \in [0,1]^{d_v}$ 0-GDN $w_t \in [0,1]^{d_v}$ 1 in the limit where erasure and write gates control keys and values independently, and fits within the broader trend of per-channel control in sequence models. Whereas FG $w_t \in [0,1]^{d_v}$ 2-GDN and FG $w_t \in [0,1]^{d_v}$ 3-GDN $w_t \in [0,1]^{d_v}$ 4 focus on per-channel step sizes (drawing analogies to AdaGrad/Adam in adaptive optimization), Gated DeltaNet-2 implements direct channel-wise gating, yielding similar fine-grained adaptation in memory dynamics at a modest incremental runtime cost (Sun et al., 21 Apr 2026).

Hybrid variants incorporating sliding-window attention retain the benefits of Gated DeltaNet-2 on long-range tasks and further boost local context modeling. This architectural flexibility, combined with efficient chunkwise implementation and state-of-the-art accuracy on retrieval and long-context understanding, position Gated DeltaNet-2 as a leading approach for scalable linear recurrent attention models (Hatamizadeh et al., 21 May 2026, Yang et al., 2024, Sun et al., 21 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (2026)

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control (2026)

Gated Delta Networks: Improving Mamba2 with Delta Rule (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated DeltaNet-2.

Gated DeltaNet-2: Efficient Linear Attention

1. Motivation and Relation to Previous Methods

2. Mathematical Formulation

3. Fast-Weight Perspective and Chunkwise WY Algorithm

4. Gate-Aware Backward Pass

5. Empirical Performance

6. Ablation Studies and Analysis

7. Architectural Impact and Connections

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gated DeltaNet-2: Efficient Linear Attention

1. Motivation and Relation to Previous Methods

2. Mathematical Formulation

3. Fast-Weight Perspective and Chunkwise WY Algorithm

4. Gate-Aware Backward Pass

5. Empirical Performance

6. Ablation Studies and Analysis

7. Architectural Impact and Connections

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research