Decay Linear Transformer

Updated 16 September 2025

Decay Linear Transformers are neural sequence models that integrate learned decay factors into linear attention to emphasize recent tokens over distant context.
They employ scalar or vector decay mechanisms to balance computational efficiency and stability, enhancing performance in language, vision, and graph tasks.
Empirical evidence shows that tuning decay parameters (median ≈0.8) minimizes attention dilution and reduces reliance on additional positional encoding.

A Decay Linear Transformer is a family of neural sequence models in which the canonical quadratic-cost self-attention mechanism is replaced or augmented with linearly efficient update rules that incorporate a controlled “decay” of the internal memory or historical state. These architectures are characterized by explicit or learned decay mechanisms—applied at either scalar or vector granularity—which modulate the influence of previous tokens, thereby embedding locality priors, improving computational tractability, and stabilizing training. Decay Linear Transformers span a spectrum of implementations, from recurrent state construction in parameterized linear attention layers to explicit decay masks in attention scores, and even temporal decay in event-driven spiking neural networks. Recent work systematically delineates their parameterization, sharing strategies, decay granularity, and interplay with positional encoding, providing a principled approach to designing, tuning, and deploying these models in language, vision, and graph domains (Qin et al., 5 Sep 2025).

1. Formalization of Decay in Linear Attention

Decay mechanisms in linear attention are defined by augmenting the state recurrence or attention weights with decay factors that attenuate the contribution of previous elements. The typical linear self-attention update is reformulated as

$S_t = \lambda_t \odot S_{t-1} + k_t v_t^T$

$o_t^T = q_t^T S_t$

where $S_t$ is the running state, $\lambda_t$ is the decay factor (scalar or vector per head, feature, or dimension), $k_t, v_t, q_t$ are the projected key, value, and query, and $\odot$ denotes elementwise multiplication. The formulation can be extended such that

For scalar decay, $\lambda_t \in (0,1)$ and is shared across all channels.
For vector decay, $\lambda_t$ is computed per feature dimension (e.g., $\lambda_t \in (0,1)^d$ ).
Decay mechanisms can be learned, fixed (e.g., exponential), or input-dependent (via gating).

Variants include explicit decay masks applied to the attention matrix, as in spatially-aware or graph-structured models, where decay is a function of positional or structural distance.

Recent systematic paper reveals four critical architectural axes for decay design (Qin et al., 5 Sep 2025):

Dimension	Options/Findings	Observed Effects
Parameterization	Scalar/vector, dedicated/shared, nonlinear gating	Parameterization quality (e.g., a median decay ≈0.8) critical
Parameter Sharing	Shared (key/decay tied) vs. independent	Arbitrary sharing detrimental; can cause under/over-decay
Granularity	Uniform (scalar) vs. feature-wise (vector)	Vector decay generally superior, but scalar can suffice if tuned
Positional Encoding	With/without explicit relative encoding (e.g., RoPE)	With strong decay, relative methods add little

The concrete computation of the decay factor may follow forms such as:

$\lambda_t = \sigma(W x_t + b)$

or vectorized analogs, where $W$ is a learned parameter matrix, and $\sigma$ is often a sigmoid.

A “Simple Decay” variant initializes $\lambda_t = \mathrm{sigmoid}(p)$ for a learnable $p$ , yielding strong performance with minimal parameter count (Qin et al., 5 Sep 2025).

3. Impact on Locality, Memory, and Position Encoding

The essential function of decay is to induce a locality prior within the attention mechanism. In the absence of decay, linear attention mechanisms suffer from attention dilution, where the model fails to focus on relevant tokens, notably those in local neighborhoods (Qin et al., 2022). Properly tuned decay factors ensure recency bias, which empirically improves performance across language and vision tasks.

When decay is applied aggressively—i.e., decay values less than one—the natural diminishing influence over longer token distances reduces the marginal utility of relative positional encoding methods such as RoPE. The recurrence imbues relative positioning implicitly, making additional positional encoding largely redundant in most configurations (Qin et al., 5 Sep 2025).

In graph domains, this manifests as an exponential decay mask:

$M_{ij} = \lambda^{\mathrm{ReLU}(\psi(v_i,v_j) - sp)}$

where $\psi(v_i, v_j)$ is a structural or spatial distance and $sp$ a learnable threshold. This construction modulates attention based on explicit topological proximity (Liu et al., 24 Apr 2024).

4. Empirical Performance and Model Stability

Experimental results consistently underscore the performance benefits of well-parameterized decay:

In language modeling, only decay configurations centering median values near 0.8 yield optimal results; excessively rapid or insufficient decay causes either underutilization of context or harmful attention dilution (Qin et al., 5 Sep 2025).
Scalar decay, despite its expressiveness limitations, can compete with vector decay if chosen to maintain the decay values in a near-optimal range. Occasional cases arise where scalar parametrizations outperform vector forms under certain initialization regimes.
Excessive or ill-tuned parameter sharing degrades accuracy, as it skews the decay values outside workable bounds, as observed in models such as LightNet and GLA.
Integrating decay with low-rank and diagonal parameterizations, and leveraging normalization schemes to control gradient scales (e.g., NormAttention, RMSNorm), further improves stability and convergence (Qin et al., 2022).

5. Applications Across Domains

Decay Linear Transformers have demonstrated efficacy in a broad spectrum of sequence and structured data scenarios:

Autoregressive generation: Decaying fast weights state mechanisms achieve near-parity with full self-attention in GPT-2, with O(1) per-token inference and highly competitive perplexity on datasets like WikiText-103 (Mao, 2022).
Vision: Spatial-aware decay, reset at raster-scan boundaries in image generation, preserves authentic 2D context and bridges the quality gap between quadratic and linear attention for high-resolution ImageNet models. LASADGen attains state-of-the-art FID/IS scores with linear complexity (Mao et al., 2 Jul 2025).
Graphs: Exponential decay masks in Graph Transformers maintain strong local focus and allow network depth scaling without the degradation typical in deep GNNs, outperforming 14 baseline models on chemical/biological classification and regression (Liu et al., 24 Apr 2024).
Spiking networks: Multi-basis exponential decay neurons enable near-lossless conversion from trained ANNs to spiking models, supporting ViT, RoBERTa, and GPT-2 with minimal timestep and energy overhead (Wang et al., 11 Aug 2025).

6. Practical Guidelines and Future Directions

Designing decay mechanisms requires a careful tradeoff among parameterization quality, resource efficiency, and representation capacity:

Set decay parameterization (learned or fixed) to yield a median decay of approximately 0.8.
Avoid indiscriminate parameter sharing in decay computation; adopt independent or weakly coupled decay for each head or feature dimension.
Select vector granularity if resource constraints allow; scalar decay may suffice for deployment if the initialization is tuned to maintain suitable localities.
Consider explicit spatial or topological constraints, as in spatial-aware and graph-aware models, to adapt decay for non-sequential structures.
Forego additional relative positional encoding if decay is strong, as the recurrence already incorporates recency/locality bias.

Open research directions include extending decay mechanisms for dynamic adaptation (e.g., noise- or content-dependence), integrating decay with learnable normalization or gating, and investigating the interplay of decay and optimization regularization in non-sequential and event-driven settings.

7. Summary Table: Key Decay Design Dimensions

Dimension	Best Practices	Notable Pitfalls
Parameterization	Median decay ≈0.8, learnable/nonlinear f	Poor initialization, no offset
Sharing	Avoid arbitrary key/decay coupling	Oversharing causes drift
Granularity	Prefer vectorized decay if feasible	Scalar can suffice if tuned
Position Encoding	Usually unnecessary with decay <1	Redundant with strong decay

In summary, Decay Linear Transformers provide a unified, principled framework for scalable sequence modeling by generalizing attention recurrence with tunable, learnable decay. This design space, now systematically charted, informs the architecture of next-generation, efficient transformers across language, vision, and graph-structured learning (Qin et al., 5 Sep 2025).