Papers
Topics
Authors
Recent
Search
2000 character limit reached

Channel-wise Erase and Write Gates

Updated 22 May 2026
  • Channel-wise Erase and Write Gates are mechanisms that enable per-channel control in recurrent linear attention architectures by decoupling erase and write operations.
  • They enhance memory fidelity and interference robustness in long-context sequence processing by separately modulating key-side erasure and value-side updates.
  • Empirical results demonstrate notable performance gains in language modeling, zero-shot reasoning, and synthetic retrieval through independent channel-level control.

Channel-wise erase and write gates are gating mechanisms that enable fine-grained, per-channel modulation of memory updates in recurrent linear attention architectures. Originally introduced in the Gated DeltaNet-2 framework, these gates separately control how existing memory content is erased (on the key side) and how new information is written (on the value side), addressing limitations of previous scalar-gated models. By decoupling erase and write operations and promoting channel-level specificity, channel-wise gates provide enhanced interference robustness and memory fidelity, particularly for long-context sequence processing (Hatamizadeh et al., 21 May 2026).

1. Mathematical Formulation

At each timestep tt, Gated DeltaNet-2 computes two channel-wise gates from the token representation xtx_t:

  • Erase gate: bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}, with WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}, applies a sigmoid activation on a learned projection.
  • Write gate: wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}, with WwRdv×dmodelW_w \in \mathbb{R}^{d_v \times d_{\rm model}}, similarly projected and passed through sigmoid.

These control, respectively, which key-side channels are erased and which value-side channels are written. Additionally, a channel-wise decay vector at=exp(gt)(0,1]dka_t = \exp(g_t) \in (0,1]^{d_k} modulates decay, where gtg_t is from a separate projection and activation.

2. Fast-Weight Memory Update and Dynamics

Gated DeltaNet-2 maintains a fast-weight state StRdk×dvS_t \in \mathbb{R}^{d_k \times d_v}. Its update comprises:

  • Decay: Sˉt=Diag(at)St1\bar{S}_t = \mathrm{Diag}(a_t) S_{t-1}
  • Edit: xtx_t0, where
    • xtx_t1 (erase vector),
    • xtx_t2 (write vector),
    • xtx_t3 and xtx_t4 are normalized key and value.

This can be organized as: xtx_t5 This structure enables rank-one erasure and writing, with per-channel modulation.

Equivalently, this update optimizes the local least-squares objective: xtx_t6 with stationary point where xtx_t7.

3. Relation to Prior Delta-Rule Models

Channel-wise gating generalizes prior delta-rule attention variants:

  • If both gates reduce to the same scalar xtx_t8, xtx_t9, bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}0, the update becomes Kimi Delta Attention (KDA):

bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}1

  • If the decay also reduces to a scalar bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}2, the formulation matches Gated DeltaNet:

bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}3

Thus, channel-wise erase and write gates subsume earlier scalar-gated approaches, enabling strictly more flexible memory control by allowing independent, channel-wise gating.

4. Efficient Chunkwise WY Algorithm

For efficient GPU training, updates are performed over chunks of length bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}4:

  • Log-decay terms bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}5 and exponentials bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}6 are accumulated.
  • Key and erase vectors are normalized by decay: bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}7, bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}8, bt=σ(Wbxt)(0,1)dkb_t = \sigma(W_b x_t) \in (0,1)^{d_k}9.
  • These yield block matrices WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}0 for all WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}1.
  • The lower-triangular score matrix, WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}2, forms a system WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}3.
  • WY auxiliaries WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}4 are computed, and final chunk updates are performed with small dense multiplications and diagonal scaling:

WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}5

This algorithm, with fixed WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}6, maintains WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}7 complexity and is efficiently mapped to tensor computation.

5. Gate-Aware Backward Propagation

Backpropagation must track per-channel gradients through the WY products. Gradients of erase and write gates must be integrated inside the accumulation of WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}8 rather than via post-scaling with a single scalar. For upstream WbRdk×dmodelW_b \in \mathbb{R}^{d_k \times d_{\rm model}}9 the process involves:

  • Updating wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}0 with contributions from wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}1 and wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}2
  • Propagating wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}3, wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}4
  • Adjusting wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}5
  • Computing per-channel gradients for wt=σ(Wwxt)(0,1)dvw_t = \sigma(W_w x_t) \in (0,1)^{d_v}6 as required.

Absence of per-channel gradients during WY accumulation results in incorrect parameter updates. Thus, channel-wise gate information must be preserved throughout the backward pass.

6. Empirical Results and Significance

Gated DeltaNet-2 models featuring channel-wise erase and write gates, at the 1.3B parameter scale with 100B FineWeb-Edu tokens, exceed the performance of Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across several tasks:

  • WikiText language modeling perplexity improved from approximately 16.8 to 15.9.
  • Zero-shot reasoning accuracy increased from ∼52.3% to 53.1%.
  • Synthetic retrieval (S-NIAH-2 @4K) increased from 89.8% to 93.0%; multi-key (MK-NIAH-1 @4K) from 31.8% to 37.8%.
  • Real-world average recall rose by ∼1.2–1.5 points.

Ablation studies reveal that collapsing either gate to a scalar consistently degrades performance, with the erase gate yielding the largest recall improvements on long-context benchmarks. Separate channel-wise control of erasure and writing is consequently essential for robust, interference-resistant, long-range fast-weight memory (Hatamizadeh et al., 21 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Channel-wise Erase and Write Gates.