Channel-wise Erase and Write Gates

Updated 22 May 2026

Channel-wise Erase and Write Gates are mechanisms that enable per-channel control in recurrent linear attention architectures by decoupling erase and write operations.
They enhance memory fidelity and interference robustness in long-context sequence processing by separately modulating key-side erasure and value-side updates.
Empirical results demonstrate notable performance gains in language modeling, zero-shot reasoning, and synthetic retrieval through independent channel-level control.

Channel-wise erase and write gates are gating mechanisms that enable fine-grained, per-channel modulation of memory updates in recurrent linear attention architectures. Originally introduced in the Gated DeltaNet-2 framework, these gates separately control how existing memory content is erased (on the key side) and how new information is written (on the value side), addressing limitations of previous scalar-gated models. By decoupling erase and write operations and promoting channel-level specificity, channel-wise gates provide enhanced interference robustness and memory fidelity, particularly for long-context sequence processing (Hatamizadeh et al., 21 May 2026).

1. Mathematical Formulation

At each timestep $t$ , Gated DeltaNet-2 computes two channel-wise gates from the token representation $x_t$ :

Erase gate: $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ , with $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ , applies a sigmoid activation on a learned projection.
Write gate: $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ , with $W_w \in \mathbb{R}^{d_v \times d_{\rm model}}$ , similarly projected and passed through sigmoid.

These control, respectively, which key-side channels are erased and which value-side channels are written. Additionally, a channel-wise decay vector $a_t = \exp(g_t) \in (0,1]^{d_k}$ modulates decay, where $g_t$ is from a separate projection and activation.

2. Fast-Weight Memory Update and Dynamics

Gated DeltaNet-2 maintains a fast-weight state $S_t \in \mathbb{R}^{d_k \times d_v}$ . Its update comprises:

Decay: $\bar{S}_t = \mathrm{Diag}(a_t) S_{t-1}$
Edit: $x_t$ $x_{t}$ 0, where
- $x_t$ 1 (erase vector),
- $x_t$ 2 (write vector),
- $x_t$ 3 and $x_t$ 4 are normalized key and value.

This can be organized as: $x_t$ 5 This structure enables rank-one erasure and writing, with per-channel modulation.

Equivalently, this update optimizes the local least-squares objective: $x_t$ 6 with stationary point where $x_t$ 7.

3. Relation to Prior Delta-Rule Models

Channel-wise gating generalizes prior delta-rule attention variants:

If both gates reduce to the same scalar $x_t$ 8, $x_t$ 9, $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 0, the update becomes Kimi Delta Attention (KDA):

$b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 1

If the decay also reduces to a scalar $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 2, the formulation matches Gated DeltaNet:

$b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 3

Thus, channel-wise erase and write gates subsume earlier scalar-gated approaches, enabling strictly more flexible memory control by allowing independent, channel-wise gating.

4. Efficient Chunkwise WY Algorithm

For efficient GPU training, updates are performed over chunks of length $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 4:

Log-decay terms $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 5 and exponentials $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 6 are accumulated.
Key and erase vectors are normalized by decay: $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 7, $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 8, $b_t = \sigma(W_b x_t) \in (0,1)^{d_k}$ 9.
These yield block matrices $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 0 for all $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 1.
The lower-triangular score matrix, $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 2, forms a system $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 3.
WY auxiliaries $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 4 are computed, and final chunk updates are performed with small dense multiplications and diagonal scaling:

$W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 5

This algorithm, with fixed $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 6, maintains $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 7 complexity and is efficiently mapped to tensor computation.

5. Gate-Aware Backward Propagation

Backpropagation must track per-channel gradients through the WY products. Gradients of erase and write gates must be integrated inside the accumulation of $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 8 rather than via post-scaling with a single scalar. For upstream $W_b \in \mathbb{R}^{d_k \times d_{\rm model}}$ 9 the process involves:

Updating $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 0 with contributions from $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 1 and $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 2
Propagating $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 3, $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 4
Adjusting $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 5
Computing per-channel gradients for $w_t = \sigma(W_w x_t) \in (0,1)^{d_v}$ 6 as required.

Absence of per-channel gradients during WY accumulation results in incorrect parameter updates. Thus, channel-wise gate information must be preserved throughout the backward pass.

6. Empirical Results and Significance

Gated DeltaNet-2 models featuring channel-wise erase and write gates, at the 1.3B parameter scale with 100B FineWeb-Edu tokens, exceed the performance of Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across several tasks:

WikiText language modeling perplexity improved from approximately 16.8 to 15.9.
Zero-shot reasoning accuracy increased from ∼52.3% to 53.1%.
Synthetic retrieval (S-NIAH-2 @4K) increased from 89.8% to 93.0%; multi-key (MK-NIAH-1 @4K) from 31.8% to 37.8%.
Real-world average recall rose by ∼1.2–1.5 points.

Ablation studies reveal that collapsing either gate to a scalar consistently degrades performance, with the erase gate yielding the largest recall improvements on long-context benchmarks. Separate channel-wise control of erasure and writing is consequently essential for robust, interference-resistant, long-range fast-weight memory (Hatamizadeh et al., 21 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Channel-wise Erase and Write Gates.

Channel-wise Erase and Write Gates

1. Mathematical Formulation

2. Fast-Weight Memory Update and Dynamics

3. Relation to Prior Delta-Rule Models

4. Efficient Chunkwise WY Algorithm

5. Gate-Aware Backward Propagation

6. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Channel-wise Erase and Write Gates

1. Mathematical Formulation

2. Fast-Weight Memory Update and Dynamics

3. Relation to Prior Delta-Rule Models

4. Efficient Chunkwise WY Algorithm

5. Gate-Aware Backward Propagation

6. Empirical Results and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research