Gated Linear Attention

Updated 19 December 2025

Gated Linear Attention is a neural sequence modeling method that integrates data-dependent gating to enhance expressivity, numerical stability, and controllable memory updates.
Its kernel-based recurrence and gating mechanism enable sub-quadratic time and memory complexity while dynamically managing state retention and forgetting.
GLA has demonstrated significant empirical gains in language, vision, speech, diffusion, and recommendation systems, improving both computational efficiency and model performance.

Gated Linear Attention (GLA) is a class of neural sequence modeling mechanisms that augment linear attention with data-dependent gating, conferring enhanced expressivity, numerical stability, and controllable memory updates. GLA mechanisms generalize standard linear attention and gated recurrent architectures, prominently enabling sub-quadratic time and memory in transformer-like models for language modeling, vision, diffusion, speech, and recommendation. The gating function introduces dynamic, input-conditioned contraction or forgetting in the context state, addressing key deficiencies of both softmax attention (vanishing gradients, restricted memory utilization) and vanilla kernel-based linear attention (unstable recurrences, low-rank bottlenecks, and compromised global context).

1. Mathematical Foundations

The central operation of Gated Linear Attention builds on the kernel-based linearization of the softmax attention kernel. In canonical linear attention, the output for a query $q_t$ is constructed as: $o_t = \frac{q_t S_t}{q_t z_t}$ where

$S_t = S_{t-1} + k_t^\top v_t,\qquad z_t = z_{t-1} + k_t^\top$

with $k_t, v_t$ the key and value projections at position $t$ . GLA generalizes this recurrence by introducing a learnable decay or gating tensor $G_t$ : $S_t = G_t \odot S_{t-1} + k_t^\top v_t,\qquad z_t = G_t \odot z_{t-1} + k_t^\top$ with $\odot$ denoting element-wise multiplication, and $G_t \in [0,1]^{d_k \times d_v}$ computed via a small per-position neural network (often using sigmoid activation) (Lemerle et al., 30 Oct 2024, Yang et al., 2023).

Variants such as GatedFWA further incorporate per-token gates $\alpha_t$ into the bias of attention logits under a sliding window mask: $\widetilde{\Phi}_{ij}^{(l)} = \frac{q_i k_j^T}{\sqrt{d_h} + (u_i - u_j)}$ where $u_t = -\sum_{q=1}^t \alpha_q$ accumulates the decay prefix, and $\alpha_t$ is produced from the input via fused MLPs and softplus preprocessing (Liu et al., 8 Dec 2025). The corresponding associative memory recurrence becomes: $M_t = \exp(-\alpha_t) M_{t-1} + \frac{1}{w}[\phi(k_t)^T v_t - c_t \phi(k_{t-w})^T v_{t-w}],\qquad c_t = \prod_{j=t-w+1}^{t-1}\exp(-\alpha_j)$ Bounding the norm of $M_t$ controls credit assignment and gradient flow.

Feature map selection is also critical: ReGLA uses a normalized exponential mapping with variance-reduction scaling, i.e., for $u \in \mathbb{R}^d$ ,

$\phi_{q}(u)_i = \exp(u_i - \max_j u_j)$

and divides by a scaling factor to maintain stable variance (Lu et al., 3 Feb 2025).

2. Algorithmic and Hardware-Efficient Implementation

GLA modules are amenable to recurrent or chunkwise parallel processing. The base forward loop per layer is: $\text{for } t=1..N: \quad S_t = G_t \odot S_{t-1} + k_t^\top v_t \quad o_t = q_t S_t$ In transformer blocks, GLA is deployed either in a fully recurrent decoding mode for $\mathcal{O}(N d^2)$ time (Lemerle et al., 30 Oct 2024, Yang et al., 2023) or in blockwise parallel modes with materialization/non-materialization schemes leveraging modern GPU memory hierarchies (HBM vs SRAM). FlashLinearAttention and related kernels fuse projection, gating, and context-state updates to minimize I/O, storing only blockwise accumulators in off-chip memory, and running high-throughput matmul-tiling in on-chip memory domains (Yang et al., 2023, Liu et al., 8 Dec 2025).

Bidirectional and locality-aware GLA designs, such as ViG, merge forward and backward scanning into single GPU kernels for vision tasks, utilizing direction-wise gating and 2D locality injection through gated depthwise convolution (Liao et al., 28 May 2024).

3. Expressivity, Memory, and Gradient Flow

The gating mechanism in GLA serves as a selective weighting operator, modulating the memory contraction and token contributions:

Selective Forgetting: As opposed to static global decay in RetNet or ALiBi, data-dependent gates $G_t$ allow fine-grained, input-conditioned memory retention or erasure, bringing the update mechanism closer to softmax's sharp attention distributions (Yang et al., 2023).
Weighted Context Aggregation: Gating realizes a prompt-specific Weighted Preconditioned Gradient Descent (WPGD), assigning data-driven sample weights $\omega_j = \prod_{t=j+1}^{n+1} G(z_t)$ to value-key pairs, strictly improving context-aware learning versus uniform linear attention (Li et al., 6 Apr 2025).
Control of Gradient Vanishing or Explosion: The gating architecture softly bounds accumulated states, curing both softmax's vanishing gradients (due to $1/t$ shrinkage) and linear attention's unbounded memory growth (Liu et al., 8 Dec 2025, Lu et al., 3 Feb 2025).

Refinements in gate design, e.g., ReGLA's composite gating formula $F_t = ((1-G_t) \odot G_t^2 + G_t \odot(1-(1-G_t)^2))$ preserve larger gradients near extreme gate values, promoting better training dynamics (Lu et al., 3 Feb 2025).

4. Connections to RNNs and Implicit Attention

GLA bridges modern recurrent models and self-attention via multiplicative gating. Linear-diagonal gated RNNs with appropriate gate and readout mappings exactly implement causal linear self-attention (Zucchet et al., 2023). The closed-form for the hidden state after $t$ steps: $h_t = \sum_{\tau=1}^{t} \left( \prod_{k=\tau+1}^t f_k \right) \odot (i_\tau \odot u_\tau)$ can be interpreted as summing weighted value vectors with implicit attention weights (Zimerman et al., 26 May 2024). Extraction and visualization of attention matrices (e.g., for explainability, XAI) leverage these implicit formulas, yielding competitive or superior attribution maps compared to Transformer softmax-based heads (Zimerman et al., 26 May 2024).

5. Empirical Results and Applications

GLA mechanisms demonstrate empirical gains across modalities:

Language modeling: GatedFWA reduces validation loss relative to LLaMA+SWA and SSM baselines, scaling throughput linearly with sequence length and achieving $\sim$ 30 $\times$ speedup at $N=64$ K (Liu et al., 8 Dec 2025).
Vision: ViG yields accuracy equal to DeiT-B with only 27% of parameters and 20% of FLOPs, running 2 $\times$ faster at $224\times224$ resolution; SAGA enhances semantic diversity and top-1 accuracy by 4.4% over PVT-T at ImageNet-1K (Liao et al., 28 May 2024, Cao et al., 16 Sep 2025).
Diffusion modeling: DiG achieves $2.5\times$ faster training and 75.7% lower GPU memory at $1792$ resolution compared to DiT-S/2, and $4.2\times$ faster than Mamba-based models at $1024$ resolution (Zhu et al., 28 May 2024).
Speech and sequence: FLASepformer with Gated FLA matches state-of-the-art separation quality at $1.91\times$ speed, consuming only 20.9% GPU memory versus SepReformer-B (Wang et al., 27 Aug 2025), and Lina-Speech achieves competitive performance in zero-shot TTS/voice cloning with 5 $\times$ fewer parameters (Lemerle et al., 30 Oct 2024).
Recommendation: GRELA with SiLU gating enables significant memory reduction and accuracy improvement over LinRec and Mamba4Rec (Hu et al., 16 Jun 2025).

GLA modules also facilitate integration with sparse attention mechanisms (e.g., NSA) and generalization to time-series, code, and event-stream domains (Liu et al., 8 Dec 2025).

Numerous GLA variants have been explored:

Refined Gating (ReGLA): Combines normalized exponential feature mapping, LayerNorm stabilization, and gradient-preserving gates to close the gap to softmax attention (Lu et al., 3 Feb 2025).
Selective Adaptive Gating (SAGA): Uses Hadamard-product factorization to achieve full-rank context repositories, breaking conventional linear attention's low-rank constraint (Cao et al., 16 Sep 2025).
Gated Rotary Enhancement (GRELA): Combines rotary positional encoding with SiLU-based gating in recommendation models, distinguishing local vs long-term user behavior (Hu et al., 16 Jun 2025).
GLU Attention: Adds Gated Linear Units in the value path of standard attention, yielding nonlinear dynamic gating and consistent convergence/accuracy gains at zero parameter overhead (Wang, 16 Jun 2025).
Gated Focused Linear Attention (GFLA): Merges focused attention kernels, depthwise convolution, and gating for stability and expressivity in audio separation (Wang et al., 27 Aug 2025).

Table: Representative GLA architectures and their gating types

Model/Variant	Gating Function	Domain
GatedFWA	ELU + softplus	Language
ReGLA	Refined sigmoids	Language
SAGA	Hadamard/sigmoid	Vision
GRELA	SiLU	Recommendation
GFLA	MLP-sigmoid	Speech
GLU Attention	GLU (SiLU)	Multi-modal

7. Limitations, Open Directions, and Concluding Remarks

Challenges remain in scaling GLA designs to multi-billion parameter settings and extremely long contexts. Optimal choice of feature mapping, normalization strategy, gating parametrization, and their interaction with advanced compression or sparse attention methods continue to be subjects of investigation (Lu et al., 3 Feb 2025, Zhu et al., 28 May 2024).

GLA offers a mathematically principled, hardware-aware path toward efficient global context modeling, unifying RNN-like recurrences and attention-based models via gating. Its explicit, input-adaptive memory control fundamentally differentiates GLA from previous linear-efficient mechanisms and is empirically validated across text, vision, audio, and recommendation workloads. Continued refinement in gating algorithms, kernel choices, and further theoretical analysis of attention landscape and optimization properties will define the next generation of efficient transformer alternatives.