Papers
Topics
Authors
Recent
Search
2000 character limit reached

GatedFWA Memory Decay in Linear Attention

Updated 13 March 2026
  • GatedFWA Memory Decay is a learnable contraction mechanism that uses per-token, per-head gating to regulate memory persistence and prevent unbounded growth in sliding-window attention.
  • It incorporates a decay bias into the attention logits, ensuring stable gradient propagation by balancing long-range credit assignment with effective memory truncation.
  • Empirical evaluations show that GatedFWA outperforms traditional Softmax and SWA in tasks like language modeling and recall-intensive benchmarks, while offering linear-time efficiency.

GatedFWA Memory Decay refers to the learnable contraction mechanism used within the Gated Flash Windowed Attention (GatedFWA) operator, a variant of sliding-window attention designed to achieve both bounded memory updates and stable gradient propagation in efficient, linear-time autoregressive models. GatedFWA introduces a per-token, per-head gate that accumulates as a decay bias on attention logits, governing the effective persistence of associative memory states, thereby addressing both unbounded growth (characteristic of ordinary Sliding Window Attention) and vanishing updates (as in Softmax full attention) (Liu et al., 8 Dec 2025).

1. Gate Definition and Decay Bias Formulation

GatedFWA implements memory decay by augmenting each attention head and token with a learnable, non-negative gate parameter αt(l,h)\alpha_t^{(l,h)}, which is derived from network inputs:

  • For layer ll, head hh, token tt:

    • Input: xt(l)Rd\mathbf{x}_t^{(l)} \in \mathbb{R}^d
    • Pre-activation: ht(l,h)=xt(l)Wg(l,h)+bg(l,h)R\mathbf{h}_t^{(l,h)} = \mathbf{x}_t^{(l)} \mathbf{W}_g^{(l,h)} + \mathbf{b}_g^{(l,h)} \in \mathbb{R}
    • Amplitude: βt(l,h)=1+ELU(xt(l)Wβ(l,h))>0\beta_t^{(l,h)} = 1 + \mathrm{ELU}(\mathbf{x}_t^{(l)} \mathbf{W}_\beta^{(l,h)}) > 0
    • Gate:

    αt(l,h)=1βt(l,h)softplus(βt(l,h)ht(l,h))>0\alpha_t^{(l,h)} = \frac{1}{\beta_t^{(l,h)}\,\mathrm{softplus}(\beta_t^{(l,h)} h_t^{(l,h)})} > 0

The gate is accumulated as a per-head prefix sum (with a single fused pass):

  • Prefix sum: ut(l,h)=q=1t(αq(l,h))u_t^{(l,h)} = \sum_{q=1}^t (-\alpha_q^{(l,h)})
  • Decay bias for tokens (t,j)(t,j): Bt,j(l,h)=ut(l,h)uj(l,h)=q=j+1tαq(l,h)0B_{t,j}^{(l,h)} = u_t^{(l,h)} - u_j^{(l,h)} = -\sum_{q=j+1}^t \alpha_q^{(l,h)} \leq 0

At computation time, the decay bias is added to the scaled dot-product logits:

Φ~t,j(l,h)=qt(l,h)kj(l,h)dh+Bt,j(l,h)\widetilde\Phi_{t,j}^{(l,h)} = \frac{\mathbf{q}_t^{(l,h)} \cdot \mathbf{k}_j^{(l,h)\,\top}}{\sqrt{d_h}} + B_{t,j}^{(l,h)}

followed by a causal sliding-window Softmax over j[tw+1,t]j \in [t-w+1,\,t].

2. Learnable Contraction in Associative Memory

In the associative memory interpretation, GatedFWA modifies the classical sliding-window memory update by multiplying the previous memory with a contraction factor determined by the decay gate:

Mt=exp(αt)Mt1+1w(ϕ(kt)vtctϕ(ktw)vtw)\mathbf{M}_t = \exp(-\alpha_t)\,\mathbf{M}_{t-1} + \frac{1}{w}\Bigl( \phi(\mathbf{k}_t)^\top\mathbf{v}_t - c_t\,\phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w} \Bigr)

where ct=q=tw+1t1exp(αq)(0,1)c_t = \prod_{q=t-w+1}^{t-1} \exp(-\alpha_q) \in (0,1). The contraction factor exp(αt)<1\exp(-\alpha_t) < 1 functions as the memory decay, ensuring that accumulated memory states remain bounded. This addresses the unbounded growth problem of ordinary SWA and prevents the excessive shrinkage of Softmax attention (Liu et al., 8 Dec 2025).

3. Comparative Update Rules: GatedFWA vs. SWA vs. Softmax

A rigorous comparison between update rules reveals the unique properties of GatedFWA memory decay.

Mechanism Update Rule Contraction/Decay
Softmax Mt=t1tMt1+1tϕ(kt)vt\mathbf{M}_t = \frac{t-1}{t}\mathbf{M}_{t-1} + \frac{1}{t}\phi(\mathbf{k}_t)^\top\mathbf{v}_t Vanishing update 1/t\sim 1/t
SWA Mt=Mt1+1w(ϕ(kt)vtϕ(ktw)vtw)\mathbf{M}_t = \mathbf{M}_{t-1} + \frac{1}{w}\bigl(\phi(\mathbf{k}_t)^\top\mathbf{v}_t - \phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w}\bigr) No decay (factor 1), unbounded growth
GatedFWA Mt=exp(αt)Mt1+1w(ϕ(kt)vtctϕ(ktw)vtw)\mathbf{M}_t = \exp(-\alpha_t)\mathbf{M}_{t-1} + \frac{1}{w}\left(\phi(\mathbf{k}_t)^\top\mathbf{v}_t - c_t\,\phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w}\right) Learnable contraction, bounded

Softmax attention rapidly vanishes information as tt grows; SWA exhibits no decay, leading to potentially unbounded memory. GatedFWA interpolates between these extremes through its learnable gating (Liu et al., 8 Dec 2025).

4. Gradient Propagation and Stability Analysis

The memory contraction directly impacts gradient propagation. For p<tp < t,

LtMp=(i=p+1texp(αi))LtMt\frac{\partial\mathcal{L}_t}{\partial\mathbf{M}_p} = \left( \prod_{i=p+1}^t \exp(-\alpha_i) \right) \frac{\partial\mathcal{L}_t}{\partial\mathbf{M}_t}

Unlike Softmax ($1/t$ decay) or SWA (constant factor 1), GatedFWA enables the model to learn the appropriate degree of gradient flow:

  • If αi0\alpha_i \approx 0, gradients are preserved, supporting long-range credit assignment.
  • If αi0\alpha_i \gg 0, gradients are suppressed for distant states, providing relevance-based memory truncation.

As each exp(α)(0,1)\exp(-\alpha) \in (0,1), the gradient neither explodes nor collapses uncontrollably, yielding stable optimization. This facilitates deeper or longer-context models, without the memory expansion or signal loss inherent in alternative mechanisms (Liu et al., 8 Dec 2025).

5. Implementation Kernels and Fused Gate Preprocessing

GatedFWA memory decay is realized efficiently in practice using two main kernel routines—one for gate preprocessing (prefix sum), another for attention computation, both compatible with FlashAttention-like architectures and linear in NN.

Gate preprocessing (1-pass fused scan):

1
2
3
4
5
6
7
8
9
10
carry = 0
for chunk i in tiles:
    load h_i, b_i
    z_i = b_i * h_i
    ν_i = max(z_i, 0)
    α_i = (ν_i + log(exp(z_iν_i)+exp(ν_i))) * (b_i+ε)^(1)
    prefix = cumsum(α_i) + carry
    store prefix into U at positions of chunk i
    carry += sum(α_i)
return U

Attention compute (FlashAttention-style, with sliding mask + gate bias):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for each row-tile i:
    load Q_i, U^q_i
    o_i, l_i = 0, 0; m_i = 
    for each col-tile j intersecting window:
        load K_j, V_j, U^k_j
        Φ = Q_i·K_jᵀ + U^q_i·1 1·(U^k_j)ᵀ
        mask entries outside [qw+1, q]  
        # Row-wise stable Softmax streaming update
        m_i = max(m_i, rowmax(Φ))
        P = exp(Φm_i)
        l_i = exp(prev_mm_i)*l_i + rowsum(P)
        o_i = exp(prev_mm_i)*o_i + P·V_j
        prev_m = m_i
    o_i = o_i / l_i
    write o_i to output
return O
Fused scanning enables negligible overhead for gate computation (e.g., 0.3\sim0.3 ms vs PyTorch’s 2.9 ms for prefix sums), and preserves linear-time attention for sequences up to N64KN \geq 64\mathrm{K} (Liu et al., 8 Dec 2025).

6. Empirical Findings and Qualitative Effects

Empirical evaluations demonstrate the efficacy of GatedFWA memory decay:

  • Recall-intensive tasks (MQAR): GatedFWA maintains near-perfect recall up to N=512N=512, outperforming SWA and SSMs at small dd.
  • Language modeling (WikiText103, OpenWebText): GatedFWA yields lower training/validation loss than SWA and meets or outperforms full Softmax for N=4096N=4096. Scaling laws mirror those of full attention at large context windows.
  • Downstream benchmarks (PiQA, HellaSwag, BoolQA): GatedFWA (with NSA integration) outperforms LLaMA+SWA in accuracy for 340M-parameter models.
  • Runtime characteristics: GatedFWA matches SWA in throughput (linear complexity), up to 30×30\times faster than Softmax-FlashAttention at N64KN\geq64\mathrm{K}.
  • Layerwise gate behavior: Early layers learn strong decay (exp(α)1\exp(-\alpha) \ll 1), providing aggressive memory contraction, while deeper layers approach exp(α)1\exp(-\alpha) \approx 1, offering longer memory persistence. Gate smoothing ameliorates boundary artifacts in NSA compressive attention patterns.

A plausible implication is that per-token memory decay in GatedFWA enables context-dependent information retention without explicit recurrence or memory selection, while ensuring both statistical and computational stability (Liu et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GatedFWA Memory Decay.