Channel-wise Erase and Write Gates
- Channel-wise Erase and Write Gates are mechanisms that enable per-channel control in recurrent linear attention architectures by decoupling erase and write operations.
- They enhance memory fidelity and interference robustness in long-context sequence processing by separately modulating key-side erasure and value-side updates.
- Empirical results demonstrate notable performance gains in language modeling, zero-shot reasoning, and synthetic retrieval through independent channel-level control.
Channel-wise erase and write gates are gating mechanisms that enable fine-grained, per-channel modulation of memory updates in recurrent linear attention architectures. Originally introduced in the Gated DeltaNet-2 framework, these gates separately control how existing memory content is erased (on the key side) and how new information is written (on the value side), addressing limitations of previous scalar-gated models. By decoupling erase and write operations and promoting channel-level specificity, channel-wise gates provide enhanced interference robustness and memory fidelity, particularly for long-context sequence processing (Hatamizadeh et al., 21 May 2026).
1. Mathematical Formulation
At each timestep , Gated DeltaNet-2 computes two channel-wise gates from the token representation :
- Erase gate: , with , applies a sigmoid activation on a learned projection.
- Write gate: , with , similarly projected and passed through sigmoid.
These control, respectively, which key-side channels are erased and which value-side channels are written. Additionally, a channel-wise decay vector modulates decay, where is from a separate projection and activation.
2. Fast-Weight Memory Update and Dynamics
Gated DeltaNet-2 maintains a fast-weight state . Its update comprises:
- Decay:
- Edit: 0, where
- 1 (erase vector),
- 2 (write vector),
- 3 and 4 are normalized key and value.
This can be organized as: 5 This structure enables rank-one erasure and writing, with per-channel modulation.
Equivalently, this update optimizes the local least-squares objective: 6 with stationary point where 7.
3. Relation to Prior Delta-Rule Models
Channel-wise gating generalizes prior delta-rule attention variants:
- If both gates reduce to the same scalar 8, 9, 0, the update becomes Kimi Delta Attention (KDA):
1
- If the decay also reduces to a scalar 2, the formulation matches Gated DeltaNet:
3
Thus, channel-wise erase and write gates subsume earlier scalar-gated approaches, enabling strictly more flexible memory control by allowing independent, channel-wise gating.
4. Efficient Chunkwise WY Algorithm
For efficient GPU training, updates are performed over chunks of length 4:
- Log-decay terms 5 and exponentials 6 are accumulated.
- Key and erase vectors are normalized by decay: 7, 8, 9.
- These yield block matrices 0 for all 1.
- The lower-triangular score matrix, 2, forms a system 3.
- WY auxiliaries 4 are computed, and final chunk updates are performed with small dense multiplications and diagonal scaling:
5
This algorithm, with fixed 6, maintains 7 complexity and is efficiently mapped to tensor computation.
5. Gate-Aware Backward Propagation
Backpropagation must track per-channel gradients through the WY products. Gradients of erase and write gates must be integrated inside the accumulation of 8 rather than via post-scaling with a single scalar. For upstream 9 the process involves:
- Updating 0 with contributions from 1 and 2
- Propagating 3, 4
- Adjusting 5
- Computing per-channel gradients for 6 as required.
Absence of per-channel gradients during WY accumulation results in incorrect parameter updates. Thus, channel-wise gate information must be preserved throughout the backward pass.
6. Empirical Results and Significance
Gated DeltaNet-2 models featuring channel-wise erase and write gates, at the 1.3B parameter scale with 100B FineWeb-Edu tokens, exceed the performance of Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across several tasks:
- WikiText language modeling perplexity improved from approximately 16.8 to 15.9.
- Zero-shot reasoning accuracy increased from ∼52.3% to 53.1%.
- Synthetic retrieval (S-NIAH-2 @4K) increased from 89.8% to 93.0%; multi-key (MK-NIAH-1 @4K) from 31.8% to 37.8%.
- Real-world average recall rose by ∼1.2–1.5 points.
Ablation studies reveal that collapsing either gate to a scalar consistently degrades performance, with the erase gate yielding the largest recall improvements on long-context benchmarks. Separate channel-wise control of erasure and writing is consequently essential for robust, interference-resistant, long-range fast-weight memory (Hatamizadeh et al., 21 May 2026).