Gated DeltaNet-2: Advanced Memory Architecture

Updated 22 May 2026

The paper introduces separate channel-wise erase and write gates to decouple memory erasure and update, enhancing fine-grained fast-weight attention.
It employs chunk-parallel computation with fused kernels and log-parameterized decay to maintain hardware efficiency and numerical stability in long-context settings.
Empirical benchmarks demonstrate improved associative recall, language modeling, and retrieval performance, setting new standards over previous delta-based architectures.

Gated DeltaNet-2 is a second-generation linear memory architecture for sequence modeling, extending and generalizing the Gated DeltaNet and Kimi Delta Attention (KDA) families. By introducing separate channel-wise gates for erasure and writing within fast-weight attention, Gated DeltaNet-2 achieves fine-grained, axis-specific memory control. This separation leads to improved associative recall and long-context handling, while maintaining hardware efficiency via chunk-parallel computation and fused kernels. The model sets new benchmarks in language modeling, retrieval, and generalization under long-range interference.

1. Motivations and Fast-Weight Foundations

Traditional linear attention compresses sequence history into a fixed-size recurrent state, typically accumulating key-value outer products without explicit memory management. Delta-rule models, such as DeltaNet and Gated DeltaNet, improve over naive accumulation by first "reading" the old content at the current key, subtracting (erasing) a fraction, and subsequently writing the new value. In earlier models, a single scalar "delta" gate controlled both the amount of erasure (on the key axis) and the strength of write (on the value axis) for each step. KDA sharpened the decay control to channel-wise vectors, but retained a single scalar for write, limiting flexibility in state editing (Hatamizadeh et al., 21 May 2026).

Gated DeltaNet-2 addresses this fundamental limitation by introducing distinct channel-wise erase and write gates—enabling directional and magnitude distinction between memory removal and new information injection. When both reduce to scalars, GDN-2 recovers KDA and Gated DeltaNet as special cases. This decoupling is central to its superiority in long-context, high-interference settings.

2. Mathematical Formalism

A. Gate Parameterizations

At each timestep $t$ , given per-token feature $x_t\in\mathbb{R}^d$ :

Erase gate: $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$
Write gate: $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$
Decay (channel-wise): Use a log-parameterization for numerical stability:

$g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$

where $a,\delta$ are head-wise learnable parameters.

B. Gated Delta Rule-2 Recurrence

Let $A_{t-1}\in\mathbb{R}^{d_k\times d_v}$ be the previous fast-weight state. Compute

Normalized key $k_t\in\mathbb{R}^{d_k}$ , value $v_t\in\mathbb{R}^{d_v}$
Gated vectors: $u_t = b_t\odot k_t$ , $x_t\in\mathbb{R}^d$ 0

Apply decay:

$x_t\in\mathbb{R}^d$ 1

Memory update (Gated Delta Rule-2, channel-wise asymmetric):

$x_t\in\mathbb{R}^d$ 2

Output read: $x_t\in\mathbb{R}^d$ 3

C. Efficient Chunkwise Parallelism

Sequences are split into chunks of length $x_t\in\mathbb{R}^d$ 4 and computed with a generalized WY algorithm:

Track cumulative log-decays $x_t\in\mathbb{R}^d$ 5, products $x_t\in\mathbb{R}^d$ 6
Rescale $x_t\in\mathbb{R}^d$ 7 within each chunk: $x_t\in\mathbb{R}^d$ 8, $x_t\in\mathbb{R}^d$ 9
State update for chunk terminal:

$b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 0

where $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 1.

All forward, backward, and state updates are implemented via fused dense kernels or small vector-Jacobian products, preserving high throughput (Hatamizadeh et al., 21 May 2026).

3. Comparison with Predecessor Architectures

Variant	Decay Gate	Erase Gate	Write Gate	State Update
Gated DeltaNet	scalar $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 2	scalar $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 3	scalar $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 4	$b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 5
Kimi Delta Attn	vector $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 6	scalar $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 7	scalar $b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 8	$b_t = \sigma(W_b x_t)\in[0,1]^{d_k}$ 9
FG $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 0-GDN	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 1	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 2	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 3	$w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 4
GDN-2	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 5	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 6	vector $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 7	$w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 8

This progression culminates in Gated DeltaNet-2, which is strictly more expressive by allowing channel-wise and axis-specific decay, erase, and write, addressing the previously imposed coupling between erasure and write strengths (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026).

4. Block Composition and Implementation

Each block in Gated DeltaNet-2 consists of the following sequence:

Gated DeltaNet-2 token mixer as described above
MLP
Optionally, Sliding-Window Attention (for hybrid variants)
MLP
RMS-norm and SiLU gating at output

Keys and queries are obtained via causal convolution, SiLU, and L2 normalization; values from a parallel conv + linear stack. Chunkwise state updates and outputs are performed with fused kernels in fp32, supporting head dimension $w_t = \sigma(W_w x_t)\in[0,1]^{d_v}$ 9, 16 heads per block, and chunk size $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 0 for kernel efficiency. All gates are computed via learned projections from the token input, with elementwise sigmoid (or log-exp for decays).

Backward pass collects gradients to gate parameters through the fused WY triangular solve, ensuring efficient and correct parallel optimization (Hatamizadeh et al., 21 May 2026).

5. Empirical Performance and Benchmarks

1.3B parameter GDN-2 models, trained on 100B tokens (FineWeb-Edu, max train length 4K), achieve:

Language modeling (WikiText/LAMBADA perp.):
- Recurrent GDN-2: 15.90 / 11.41
- Hybrid GDN-2: 15.62 / 10.43
- Stronger than both Gated DeltaNet (16.40 / 11.89) and KDA (16.81 / 11.68)
Commonsense reasoning (PIQA $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 1BoolQ avg. acc): 53.11% (recurrent), 53.97% (hybrid)—highest among all considered variants
Needle-in-a-haystack (RULER): Best retrieval performance, especially in multi-key and interference-prone settings; maintains high accuracy as context length increases
Real-world retrieval (SWDE, SQuAD, TriviaQA, FDA, NQ, DROP): 29.88% recall (recurrent), 42.28% (hybrid)—both outperforming previous linear attention variants
Throughput: Maintains 38–36Kt/s over 2K–16K tokens on H100, $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 2 slower than KDA, with all gating overhead handled via elementwise operations (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026)

Ablations confirm that channel-wise erase ( $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 3) yields the majority of the gain, while distinct writes ( $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 4) further boost performance, especially in retrieval and few-shot tasks. This suggests that axis-specific, learnable update rates are critical for robust associative memory under interference.

6. Geometric Perspective and Deep Residual Generalization

Gated DeltaNet-2 admits a geometric interpretation as a depthwise rank-1 operator, similar to Deep Delta Learning (Zhang et al., 1 Jan 2026). The delta operator $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 5 interpolates between identity, projection, and reflection, with the gate controlling spectral behavior:

$g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 6: identity mapping
$g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 7: projection onto $g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 8
$g_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}$ 9: Householder reflection across $a,\delta$ 0

In DDL-style architectures, the residual is modulated by a synchronous, gated delta: $a,\delta$ 1. This admits enhanced gradient stability, fast convergence, and improved calibration, and GDN-2 can be interpreted as the sequential (recurrent) version of this geometric update (Zhang et al., 1 Jan 2026).

A plausible implication is that the deep connection between projection-based memory control and fast-weight indexing facilitates better matching of the memory update structure to the algebraic needs of sequence modeling.

7. Best Practices and Practical Considerations

Both erase and write gates must be channel-wise for optimal associative recall and stability; scalar approximations degrade performance by 0.3–0.7 perplexity and multiple points of retrieval accuracy.
Decay should be implemented via log-parameterization in fp32 to avoid roundoff in long-range contexts.
Preferred kernel fusion strategies operate at chunk size $a,\delta$ 2 for maximal throughput.
L2 normalization of keys/queries per head improves numerical stability.
Chunkwise WY kernel and vector-Jacobian backward are essential to preserve both efficiency and differentiability in training.

Gated DeltaNet-2 demonstrates that independent, vectorized erasing and writing within fast-weight memory models are essential for overcoming interference and saturation in long-context settings, establishing a new standard among linear attention mechanisms (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026, Zhang et al., 1 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (2026)

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control (2026)

Deep Delta Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated DeltaNet-2 Architecture.

Gated DeltaNet-2: Advanced Memory Architecture

1. Motivations and Fast-Weight Foundations

2. Mathematical Formalism

A. Gate Parameterizations

B. Gated Delta Rule-2 Recurrence

C. Efficient Chunkwise Parallelism

3. Comparison with Predecessor Architectures

4. Block Composition and Implementation

5. Empirical Performance and Benchmarks

6. Geometric Perspective and Deep Residual Generalization

7. Best Practices and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gated DeltaNet-2: Advanced Memory Architecture

1. Motivations and Fast-Weight Foundations

2. Mathematical Formalism

A. Gate Parameterizations

B. Gated Delta Rule-2 Recurrence

C. Efficient Chunkwise Parallelism

3. Comparison with Predecessor Architectures

4. Block Composition and Implementation

5. Empirical Performance and Benchmarks

6. Geometric Perspective and Deep Residual Generalization

7. Best Practices and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research