GatedFWA Memory Decay in Linear Attention
- GatedFWA Memory Decay is a learnable contraction mechanism that uses per-token, per-head gating to regulate memory persistence and prevent unbounded growth in sliding-window attention.
- It incorporates a decay bias into the attention logits, ensuring stable gradient propagation by balancing long-range credit assignment with effective memory truncation.
- Empirical evaluations show that GatedFWA outperforms traditional Softmax and SWA in tasks like language modeling and recall-intensive benchmarks, while offering linear-time efficiency.
GatedFWA Memory Decay refers to the learnable contraction mechanism used within the Gated Flash Windowed Attention (GatedFWA) operator, a variant of sliding-window attention designed to achieve both bounded memory updates and stable gradient propagation in efficient, linear-time autoregressive models. GatedFWA introduces a per-token, per-head gate that accumulates as a decay bias on attention logits, governing the effective persistence of associative memory states, thereby addressing both unbounded growth (characteristic of ordinary Sliding Window Attention) and vanishing updates (as in Softmax full attention) (Liu et al., 8 Dec 2025).
1. Gate Definition and Decay Bias Formulation
GatedFWA implements memory decay by augmenting each attention head and token with a learnable, non-negative gate parameter , which is derived from network inputs:
- For layer , head , token :
- Input:
- Pre-activation:
- Amplitude:
- Gate:
The gate is accumulated as a per-head prefix sum (with a single fused pass):
- Prefix sum:
- Decay bias for tokens :
At computation time, the decay bias is added to the scaled dot-product logits:
followed by a causal sliding-window Softmax over .
2. Learnable Contraction in Associative Memory
In the associative memory interpretation, GatedFWA modifies the classical sliding-window memory update by multiplying the previous memory with a contraction factor determined by the decay gate:
where . The contraction factor functions as the memory decay, ensuring that accumulated memory states remain bounded. This addresses the unbounded growth problem of ordinary SWA and prevents the excessive shrinkage of Softmax attention (Liu et al., 8 Dec 2025).
3. Comparative Update Rules: GatedFWA vs. SWA vs. Softmax
A rigorous comparison between update rules reveals the unique properties of GatedFWA memory decay.
| Mechanism | Update Rule | Contraction/Decay |
|---|---|---|
| Softmax | Vanishing update | |
| SWA | No decay (factor 1), unbounded growth | |
| GatedFWA | Learnable contraction, bounded |
Softmax attention rapidly vanishes information as grows; SWA exhibits no decay, leading to potentially unbounded memory. GatedFWA interpolates between these extremes through its learnable gating (Liu et al., 8 Dec 2025).
4. Gradient Propagation and Stability Analysis
The memory contraction directly impacts gradient propagation. For ,
Unlike Softmax ($1/t$ decay) or SWA (constant factor 1), GatedFWA enables the model to learn the appropriate degree of gradient flow:
- If , gradients are preserved, supporting long-range credit assignment.
- If , gradients are suppressed for distant states, providing relevance-based memory truncation.
As each , the gradient neither explodes nor collapses uncontrollably, yielding stable optimization. This facilitates deeper or longer-context models, without the memory expansion or signal loss inherent in alternative mechanisms (Liu et al., 8 Dec 2025).
5. Implementation Kernels and Fused Gate Preprocessing
GatedFWA memory decay is realized efficiently in practice using two main kernel routines—one for gate preprocessing (prefix sum), another for attention computation, both compatible with FlashAttention-like architectures and linear in .
Gate preprocessing (1-pass fused scan):
1 2 3 4 5 6 7 8 9 10 |
carry = 0 for chunk i in tiles: load h_i, b_i z_i = b_i * h_i ν_i = max(z_i, 0) α_i = (ν_i + log(exp(z_i−ν_i)+exp(−ν_i))) * (b_i+ε)^(−1) prefix = cumsum(−α_i) + carry store prefix into U at positions of chunk i carry += sum(−α_i) return U |
Attention compute (FlashAttention-style, with sliding mask + gate bias):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for each row-tile i: load Q_i, U^q_i o_i, l_i = 0, 0; m_i = −∞ for each col-tile j intersecting window: load K_j, V_j, U^k_j Φ = Q_i·K_jᵀ + U^q_i·1ᵀ − 1·(U^k_j)ᵀ mask entries outside [q−w+1, q] ← −∞ # Row-wise stable Softmax streaming update m_i = max(m_i, rowmax(Φ)) P = exp(Φ−m_i) l_i = exp(prev_m−m_i)*l_i + rowsum(P) o_i = exp(prev_m−m_i)*o_i + P·V_j prev_m = m_i o_i = o_i / l_i write o_i to output return O |
6. Empirical Findings and Qualitative Effects
Empirical evaluations demonstrate the efficacy of GatedFWA memory decay:
- Recall-intensive tasks (MQAR): GatedFWA maintains near-perfect recall up to , outperforming SWA and SSMs at small .
- Language modeling (WikiText103, OpenWebText): GatedFWA yields lower training/validation loss than SWA and meets or outperforms full Softmax for . Scaling laws mirror those of full attention at large context windows.
- Downstream benchmarks (PiQA, HellaSwag, BoolQA): GatedFWA (with NSA integration) outperforms LLaMA+SWA in accuracy for 340M-parameter models.
- Runtime characteristics: GatedFWA matches SWA in throughput (linear complexity), up to faster than Softmax-FlashAttention at .
- Layerwise gate behavior: Early layers learn strong decay (), providing aggressive memory contraction, while deeper layers approach , offering longer memory persistence. Gate smoothing ameliorates boundary artifacts in NSA compressive attention patterns.
A plausible implication is that per-token memory decay in GatedFWA enables context-dependent information retention without explicit recurrence or memory selection, while ensuring both statistical and computational stability (Liu et al., 8 Dec 2025).