GatedFWA Memory Decay in Linear Attention

Updated 13 March 2026

GatedFWA Memory Decay is a learnable contraction mechanism that uses per-token, per-head gating to regulate memory persistence and prevent unbounded growth in sliding-window attention.
It incorporates a decay bias into the attention logits, ensuring stable gradient propagation by balancing long-range credit assignment with effective memory truncation.
Empirical evaluations show that GatedFWA outperforms traditional Softmax and SWA in tasks like language modeling and recall-intensive benchmarks, while offering linear-time efficiency.

GatedFWA Memory Decay refers to the learnable contraction mechanism used within the Gated Flash Windowed Attention (GatedFWA) operator, a variant of sliding-window attention designed to achieve both bounded memory updates and stable gradient propagation in efficient, linear-time autoregressive models. GatedFWA introduces a per-token, per-head gate that accumulates as a decay bias on attention logits, governing the effective persistence of associative memory states, thereby addressing both unbounded growth (characteristic of ordinary Sliding Window Attention) and vanishing updates (as in Softmax full attention) (Liu et al., 8 Dec 2025).

1. Gate Definition and Decay Bias Formulation

GatedFWA implements memory decay by augmenting each attention head and token with a learnable, non-negative gate parameter $\alpha_t^{(l,h)}$ , which is derived from network inputs:

For layer $l$ $l$ , head $h$ $h$ , token $t$ $t$ :
- Input: $\mathbf{x}_t^{(l)} \in \mathbb{R}^d$
- Pre-activation: $\mathbf{h}_t^{(l,h)} = \mathbf{x}_t^{(l)} \mathbf{W}_g^{(l,h)} + \mathbf{b}_g^{(l,h)} \in \mathbb{R}$
- Amplitude: $\beta_t^{(l,h)} = 1 + \mathrm{ELU}(\mathbf{x}_t^{(l)} \mathbf{W}_\beta^{(l,h)}) > 0$
- Gate:
$\alpha_t^{(l,h)} = \frac{1}{\beta_t^{(l,h)}\,\mathrm{softplus}(\beta_t^{(l,h)} h_t^{(l,h)})} > 0$

The gate is accumulated as a per-head prefix sum (with a single fused pass):

Prefix sum: $u_t^{(l,h)} = \sum_{q=1}^t (-\alpha_q^{(l,h)})$
Decay bias for tokens $(t,j)$ : $B_{t,j}^{(l,h)} = u_t^{(l,h)} - u_j^{(l,h)} = -\sum_{q=j+1}^t \alpha_q^{(l,h)} \leq 0$

At computation time, the decay bias is added to the scaled dot-product logits:

$\widetilde\Phi_{t,j}^{(l,h)} = \frac{\mathbf{q}_t^{(l,h)} \cdot \mathbf{k}_j^{(l,h)\,\top}}{\sqrt{d_h}} + B_{t,j}^{(l,h)}$

followed by a causal sliding-window Softmax over $j \in [t-w+1,\,t]$ .

2. Learnable Contraction in Associative Memory

In the associative memory interpretation, GatedFWA modifies the classical sliding-window memory update by multiplying the previous memory with a contraction factor determined by the decay gate:

$\mathbf{M}_t = \exp(-\alpha_t)\,\mathbf{M}_{t-1} + \frac{1}{w}\Bigl( \phi(\mathbf{k}_t)^\top\mathbf{v}_t - c_t\,\phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w} \Bigr)$

where $c_t = \prod_{q=t-w+1}^{t-1} \exp(-\alpha_q) \in (0,1)$ . The contraction factor $\exp(-\alpha_t) < 1$ functions as the memory decay, ensuring that accumulated memory states remain bounded. This addresses the unbounded growth problem of ordinary SWA and prevents the excessive shrinkage of Softmax attention (Liu et al., 8 Dec 2025).

3. Comparative Update Rules: GatedFWA vs. SWA vs. Softmax

A rigorous comparison between update rules reveals the unique properties of GatedFWA memory decay.

Mechanism	Update Rule	Contraction/Decay
Softmax	$\mathbf{M}_t = \frac{t-1}{t}\mathbf{M}_{t-1} + \frac{1}{t}\phi(\mathbf{k}_t)^\top\mathbf{v}_t$	Vanishing update $\sim 1/t$
SWA	$\mathbf{M}_t = \mathbf{M}_{t-1} + \frac{1}{w}\bigl(\phi(\mathbf{k}_t)^\top\mathbf{v}_t - \phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w}\bigr)$	No decay (factor 1), unbounded growth
GatedFWA	$\mathbf{M}_t = \exp(-\alpha_t)\mathbf{M}_{t-1} + \frac{1}{w}\left(\phi(\mathbf{k}_t)^\top\mathbf{v}_t - c_t\,\phi(\mathbf{k}_{t-w})^\top\mathbf{v}_{t-w}\right)$	Learnable contraction, bounded

Softmax attention rapidly vanishes information as $t$ grows; SWA exhibits no decay, leading to potentially unbounded memory. GatedFWA interpolates between these extremes through its learnable gating (Liu et al., 8 Dec 2025).

4. Gradient Propagation and Stability Analysis

The memory contraction directly impacts gradient propagation. For $p < t$ ,

$\frac{\partial\mathcal{L}_t}{\partial\mathbf{M}_p} = \left( \prod_{i=p+1}^t \exp(-\alpha_i) \right) \frac{\partial\mathcal{L}_t}{\partial\mathbf{M}_t}$

Unlike Softmax ($1/t$ decay) or SWA (constant factor 1), GatedFWA enables the model to learn the appropriate degree of gradient flow:

If $\alpha_i \approx 0$ , gradients are preserved, supporting long-range credit assignment.
If $\alpha_i \gg 0$ , gradients are suppressed for distant states, providing relevance-based memory truncation.

As each $\exp(-\alpha) \in (0,1)$ , the gradient neither explodes nor collapses uncontrollably, yielding stable optimization. This facilitates deeper or longer-context models, without the memory expansion or signal loss inherent in alternative mechanisms (Liu et al., 8 Dec 2025).

5. Implementation Kernels and Fused Gate Preprocessing

GatedFWA memory decay is realized efficiently in practice using two main kernel routines—one for gate preprocessing (prefix sum), another for attention computation, both compatible with FlashAttention-like architectures and linear in $N$ .

Gate preprocessing (1-pass fused scan):

carry = 0
for chunk i in tiles:
    load h_i, b_i
    z_i = b_i * h_i
    ν_i = max(z_i, 0)
    α_i = (ν_i + log(exp(z_i−ν_i)+exp(−ν_i))) * (b_i+ε)^(−1)
    prefix = cumsum(−α_i) + carry
    store prefix into U at positions of chunk i
    carry += sum(−α_i)
return U

Attention compute (FlashAttention-style, with sliding mask + gate bias):

for each row-tile i:
    load Q_i, U^q_i
    o_i, l_i = 0, 0; m_i = −∞
    for each col-tile j intersecting window:
        load K_j, V_j, U^k_j
        Φ = Q_i·K_jᵀ + U^q_i·1ᵀ − 1·(U^k_j)ᵀ
        mask entries outside [q−w+1, q] ← −∞
        # Row-wise stable Softmax streaming update
        m_i = max(m_i, rowmax(Φ))
        P = exp(Φ−m_i)
        l_i = exp(prev_m−m_i)*l_i + rowsum(P)
        o_i = exp(prev_m−m_i)*o_i + P·V_j
        prev_m = m_i
    o_i = o_i / l_i
    write o_i to output
return O

Fused scanning enables negligible overhead for gate computation (e.g.,

\sim0.3

ms vs PyTorch’s 2.9 ms for prefix sums), and preserves linear-time attention for sequences up to

N \geq 64\mathrm{K}

(Liu et al., 8 Dec 2025).

6. Empirical Findings and Qualitative Effects

Empirical evaluations demonstrate the efficacy of GatedFWA memory decay:

Recall-intensive tasks (MQAR): GatedFWA maintains near-perfect recall up to $N=512$ , outperforming SWA and SSMs at small $d$ .
Language modeling (WikiText103, OpenWebText): GatedFWA yields lower training/validation loss than SWA and meets or outperforms full Softmax for $N=4096$ . Scaling laws mirror those of full attention at large context windows.
Downstream benchmarks (PiQA, HellaSwag, BoolQA): GatedFWA (with NSA integration) outperforms LLaMA+SWA in accuracy for 340M-parameter models.
Runtime characteristics: GatedFWA matches SWA in throughput (linear complexity), up to $30\times$ faster than Softmax-FlashAttention at $N\geq64\mathrm{K}$ .
Layerwise gate behavior: Early layers learn strong decay ( $\exp(-\alpha) \ll 1$ ), providing aggressive memory contraction, while deeper layers approach $\exp(-\alpha) \approx 1$ , offering longer memory persistence. Gate smoothing ameliorates boundary artifacts in NSA compressive attention patterns.

A plausible implication is that per-token memory decay in GatedFWA enables context-dependent information retention without explicit recurrence or memory selection, while ensuring both statistical and computational stability (Liu et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GatedFWA Memory Decay.

GatedFWA Memory Decay in Linear Attention

1. Gate Definition and Decay Bias Formulation

2. Learnable Contraction in Associative Memory

3. Comparative Update Rules: GatedFWA vs. SWA vs. Softmax

4. Gradient Propagation and Stability Analysis

5. Implementation Kernels and Fused Gate Preprocessing

6. Empirical Findings and Qualitative Effects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GatedFWA Memory Decay in Linear Attention

1. Gate Definition and Decay Bias Formulation

2. Learnable Contraction in Associative Memory

3. Comparative Update Rules: GatedFWA vs. SWA vs. Softmax

4. Gradient Propagation and Stability Analysis

5. Implementation Kernels and Fused Gate Preprocessing

6. Empirical Findings and Qualitative Effects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research