Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated DeltaNet-2: Advanced Memory Architecture

Updated 22 May 2026
  • The paper introduces separate channel-wise erase and write gates to decouple memory erasure and update, enhancing fine-grained fast-weight attention.
  • It employs chunk-parallel computation with fused kernels and log-parameterized decay to maintain hardware efficiency and numerical stability in long-context settings.
  • Empirical benchmarks demonstrate improved associative recall, language modeling, and retrieval performance, setting new standards over previous delta-based architectures.

Gated DeltaNet-2 is a second-generation linear memory architecture for sequence modeling, extending and generalizing the Gated DeltaNet and Kimi Delta Attention (KDA) families. By introducing separate channel-wise gates for erasure and writing within fast-weight attention, Gated DeltaNet-2 achieves fine-grained, axis-specific memory control. This separation leads to improved associative recall and long-context handling, while maintaining hardware efficiency via chunk-parallel computation and fused kernels. The model sets new benchmarks in language modeling, retrieval, and generalization under long-range interference.

1. Motivations and Fast-Weight Foundations

Traditional linear attention compresses sequence history into a fixed-size recurrent state, typically accumulating key-value outer products without explicit memory management. Delta-rule models, such as DeltaNet and Gated DeltaNet, improve over naive accumulation by first "reading" the old content at the current key, subtracting (erasing) a fraction, and subsequently writing the new value. In earlier models, a single scalar "delta" gate controlled both the amount of erasure (on the key axis) and the strength of write (on the value axis) for each step. KDA sharpened the decay control to channel-wise vectors, but retained a single scalar for write, limiting flexibility in state editing (Hatamizadeh et al., 21 May 2026).

Gated DeltaNet-2 addresses this fundamental limitation by introducing distinct channel-wise erase and write gates—enabling directional and magnitude distinction between memory removal and new information injection. When both reduce to scalars, GDN-2 recovers KDA and Gated DeltaNet as special cases. This decoupling is central to its superiority in long-context, high-interference settings.

2. Mathematical Formalism

A. Gate Parameterizations

At each timestep tt, given per-token feature xtRdx_t\in\mathbb{R}^d:

  • Erase gate: bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}
  • Write gate: wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}
  • Decay (channel-wise): Use a log-parameterization for numerical stability:

gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}

where a,δa,\delta are head-wise learnable parameters.

B. Gated Delta Rule-2 Recurrence

Let At1Rdk×dvA_{t-1}\in\mathbb{R}^{d_k\times d_v} be the previous fast-weight state. Compute

  • Normalized key ktRdkk_t\in\mathbb{R}^{d_k}, value vtRdvv_t\in\mathbb{R}^{d_v}
  • Gated vectors: ut=btktu_t = b_t\odot k_t, xtRdx_t\in\mathbb{R}^d0

Apply decay:

xtRdx_t\in\mathbb{R}^d1

Memory update (Gated Delta Rule-2, channel-wise asymmetric):

xtRdx_t\in\mathbb{R}^d2

Output read: xtRdx_t\in\mathbb{R}^d3

C. Efficient Chunkwise Parallelism

Sequences are split into chunks of length xtRdx_t\in\mathbb{R}^d4 and computed with a generalized WY algorithm:

  • Track cumulative log-decays xtRdx_t\in\mathbb{R}^d5, products xtRdx_t\in\mathbb{R}^d6
  • Rescale xtRdx_t\in\mathbb{R}^d7 within each chunk: xtRdx_t\in\mathbb{R}^d8, xtRdx_t\in\mathbb{R}^d9
  • State update for chunk terminal:

bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}0

where bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}1.

All forward, backward, and state updates are implemented via fused dense kernels or small vector-Jacobian products, preserving high throughput (Hatamizadeh et al., 21 May 2026).

3. Comparison with Predecessor Architectures

Variant Decay Gate Erase Gate Write Gate State Update
Gated DeltaNet scalar bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}2 scalar bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}3 scalar bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}4 bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}5
Kimi Delta Attn vector bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}6 scalar bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}7 scalar bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}8 bt=σ(Wbxt)[0,1]dkb_t = \sigma(W_b x_t)\in[0,1]^{d_k}9
FGwt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}0-GDN vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}1 vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}2 vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}3 wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}4
GDN-2 vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}5 vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}6 vector wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}7 wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}8

This progression culminates in Gated DeltaNet-2, which is strictly more expressive by allowing channel-wise and axis-specific decay, erase, and write, addressing the previously imposed coupling between erasure and write strengths (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026).

4. Block Composition and Implementation

Each block in Gated DeltaNet-2 consists of the following sequence:

Keys and queries are obtained via causal convolution, SiLU, and L2 normalization; values from a parallel conv + linear stack. Chunkwise state updates and outputs are performed with fused kernels in fp32, supporting head dimension wt=σ(Wwxt)[0,1]dvw_t = \sigma(W_w x_t)\in[0,1]^{d_v}9, 16 heads per block, and chunk size gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}0 for kernel efficiency. All gates are computed via learned projections from the token input, with elementwise sigmoid (or log-exp for decays).

Backward pass collects gradients to gate parameters through the fused WY triangular solve, ensuring efficient and correct parallel optimization (Hatamizadeh et al., 21 May 2026).

5. Empirical Performance and Benchmarks

1.3B parameter GDN-2 models, trained on 100B tokens (FineWeb-Edu, max train length 4K), achieve:

  • Language modeling (WikiText/LAMBADA perp.):
    • Recurrent GDN-2: 15.90 / 11.41
    • Hybrid GDN-2: 15.62 / 10.43
    • Stronger than both Gated DeltaNet (16.40 / 11.89) and KDA (16.81 / 11.68)
  • Commonsense reasoning (PIQAgt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}1BoolQ avg. acc): 53.11% (recurrent), 53.97% (hybrid)—highest among all considered variants
  • Needle-in-a-haystack (RULER): Best retrieval performance, especially in multi-key and interference-prone settings; maintains high accuracy as context length increases
  • Real-world retrieval (SWDE, SQuAD, TriviaQA, FDA, NQ, DROP): 29.88% recall (recurrent), 42.28% (hybrid)—both outperforming previous linear attention variants
  • Throughput: Maintains 38–36Kt/s over 2K–16K tokens on H100, gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}2 slower than KDA, with all gating overhead handled via elementwise operations (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026)

Ablations confirm that channel-wise erase (gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}3) yields the majority of the gain, while distinct writes (gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}4) further boost performance, especially in retrieval and few-shot tasks. This suggests that axis-specific, learnable update rates are critical for robust associative memory under interference.

6. Geometric Perspective and Deep Residual Generalization

Gated DeltaNet-2 admits a geometric interpretation as a depthwise rank-1 operator, similar to Deep Delta Learning (Zhang et al., 1 Jan 2026). The delta operator gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}5 interpolates between identity, projection, and reflection, with the gate controlling spectral behavior:

  • gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}6: identity mapping
  • gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}7: projection onto gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}8
  • gt=exp(a)softplus(Wfxt+δ),αt=exp(gt)(0,1]dkg_t = -\exp(a)\odot\mathrm{softplus}(W_f x_t+\delta),\qquad \alpha_t = \exp(g_t)\in(0,1]^{d_k}9: Householder reflection across a,δa,\delta0

In DDL-style architectures, the residual is modulated by a synchronous, gated delta: a,δa,\delta1. This admits enhanced gradient stability, fast convergence, and improved calibration, and GDN-2 can be interpreted as the sequential (recurrent) version of this geometric update (Zhang et al., 1 Jan 2026).

A plausible implication is that the deep connection between projection-based memory control and fast-weight indexing facilitates better matching of the memory update structure to the algebraic needs of sequence modeling.

7. Best Practices and Practical Considerations

  • Both erase and write gates must be channel-wise for optimal associative recall and stability; scalar approximations degrade performance by 0.3–0.7 perplexity and multiple points of retrieval accuracy.
  • Decay should be implemented via log-parameterization in fp32 to avoid roundoff in long-range contexts.
  • Preferred kernel fusion strategies operate at chunk size a,δa,\delta2 for maximal throughput.
  • L2 normalization of keys/queries per head improves numerical stability.
  • Chunkwise WY kernel and vector-Jacobian backward are essential to preserve both efficiency and differentiability in training.

Gated DeltaNet-2 demonstrates that independent, vectorized erasing and writing within fast-weight memory models are essential for overcoming interference and saturation in long-context settings, establishing a new standard among linear attention mechanisms (Hatamizadeh et al., 21 May 2026, Sun et al., 21 Apr 2026, Zhang et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated DeltaNet-2 Architecture.