Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated Linear RNNs: Efficient Sequence Models

Updated 23 February 2026
  • Gated Linear RNNs are sequence models that combine linear recurrences with data-dependent gating to achieve efficient training and robust extrapolation.
  • They unify a spectrum of architectures—including HGRN, RG-LRU, and Gated Slot Attention—demonstrating competitive performance and hardware efficiency.
  • Their design enables implicit self-attention through dynamic gating and structured recurrences, leading to linear time scaling and enhanced interpretability.

Gated Linear Recurrent Neural Networks (Gated Linear RNNs, GLRNNs) are a class of sequence models that combine linear recurrences with data-dependent gating, yielding architectures with highly efficient training and inference, robust extrapolation, and a strong connection to various forms of (linear) self-attention. Recent advances have unified a spectrum of models under the GLRNN regime, including HGRN, HGRN2, RG-LRU, Griffin, Gated Slot Attention, and others. These architectures have demonstrated competitive or superior performance to Transformers on multiple sequence modeling tasks, while offering hardware efficiency, explainability, and theoretical tractability.

1. Core Principles and Model Formulations

The central design of Gated Linear RNNs comprises two components: a (linear or affine) diagonal recurrence and data-dependent element-wise gates. A canonical formulation is: ht=ft⊙ht−1+it⊙xt,h_t = f_t \odot h_{t-1} + i_t \odot x_t, where ftf_t and iti_t are "forget" and "input" gates (typically with ft,it∈(0,1)df_t, i_t \in (0,1)^d parameterized via sigmoid nonlinearities), and xtx_t is the input at time tt. Some variants exploit pure addition with ReLU gating instead of standard sigmoids and multiplications (Brännvall et al., 2023).

The gates control the flow of information and dynamic memory retention. Notably, the states can often be shown to be convex, data-dependent, per-component mixtures of past inputs, giving the recurrences equivalence to a form of implicit attention (Lee et al., 2017, Zucchet et al., 2023, Zimerman et al., 2024).

Modern GLRNNs generalize this basic structure by:

In all cases, the architectures are designed for O(Td)O(Td) inference and training per layer per sequence of length TT with hidden size dd, and support efficient parallelization.

2. Hierarchically Gated RNNs and State Expansion

The Hierarchically Gated Recurrent Neural Network (HGRN) introduced a layered gating hierarchy, wherein each layer kk enforces a monotonic, learnable lower bound γk\gamma^k on the data-dependent forget gate: λt=γk+(1−γk)⊙μ~t,μ~t=σ(xtWμ+bμ)\lambda_t = \gamma^k + (1 - \gamma^k)\odot \tilde{\mu}_t, \quad \tilde{\mu}_t = \sigma(x_t W_{\mu} + b_{\mu}) The lower bound γk\gamma^k is computed as a cumulative sum over a softmax across layers, enforcing that lower layers emphasize local/short-term dependencies, while higher layers capture longer-term context (Qin et al., 2023).

HGRN2 extends this by state expansion using an outer product mechanism: Ht=Diag(ft)Ht−1+(1−ft)⊗it,yt=ot⊤HtH_t = \mathrm{Diag}(f_t) H_{t-1} + (1-f_t) \otimes i_t, \quad y_t = o_t^\top H_t The hidden state becomes matrix-valued (up to d2d^2), drastically increasing capacity with negligible additional parameters. HGRN2 thereby matches the expressiveness of linear attention mechanisms, while supporting hardware-efficient training using blockwise matrix operations (Qin et al., 2024).

3. Connection to Attention and Unified Implicit Causal Self-Attention

GLRNNs are mathematically equivalent to implicit, data-dependent causal self-attention layers, as shown by the mapping: y=Wg(x)  A(x)  Z(x)  M xy = W_g(x)\;\Alpha(x)\;Z(x)\;M\,x where Wg(x)W_g(x) is a diagonal gating matrix, A(x)\Alpha(x) is a lower-triangular, data-dependent attention matrix parameterized via the recurrence, Z(x)Z(x) is an activation branch, and MM is a causal filter (e.g., Conv1D) (Zimerman et al., 2024).

This framework encompasses Mamba, RWKV, Griffin (RG-LRU), GateLoop, HGRN2, and related models, unifying them as sub-quadratic, attention-like operators. It has led to explainability methods for GLRNNs analogous to Transformer-based attention visualization, attribution, and rollout procedures, with empirical evidence that these provide faithful and sharp relevance maps in both vision and language contexts (Zimerman et al., 2024).

Gated Linear RNNs can exactly implement (linear) self-attention by orchestrating their gating, state, and output projections to replicate the "key–value–query" structure of attention. Gradient descent converges on this solution in practice, as observed empirically (Zucchet et al., 2023).

4. Architectural Spectrum and Variants

The GLRNN design space encompasses:

  • Purely additive networks (RANs): ht=ft⊙ht−1+it⊙xth_t = f_t \odot h_{t-1} + i_t \odot x_t without nonlinear recurrences; performance on par with LSTM for language modeling, at lower parameter and compute cost (Lee et al., 2017).
  • Bilinear and multiplicative gates: Used in attention-equivalent GLRNNs to manufacture exact key, value, and query representations (Zucchet et al., 2023).
  • Addition-based Gated RNNs: Replace sigmoid and multiplication with ReLU and addition, providing substantial computational savings in plaintext and homomorphic encryption settings, while preserving long-term memory and sequence learning power (Brännvall et al., 2023).
  • RG-LRU (Griffin): Employs a recurrence gate and input gate; integrates into hybrid architectures with local attention, achieving hardware-efficient training, excellent extrapolation, and high accuracy at scale (De et al., 2024).
  • Slot attention and bounded-memory extensions: Gated Slot Attention (GSA) manages a matrix-valued slot state and realizes a two-pass Gated Linear Attention, linking via softmax to stabilize gradients and enhance recall (Zhang et al., 2024).

5. Empirical Performance and Scaling Properties

GLRNNs achieve highly competitive results on a wide range of benchmarks, often equaling or outperforming efficient Transformer derivatives and prior RNNs:

Model Language Modeling (WT-103, PPL) Long-Range Arena (Avg. Acc) ImageNet-1k (Top-1%)
Transformer 24.40 (val) / 24.78 (test) – 72.20 (DeiT-Tiny)
HGRN 24.14 / 24.82 86.91 74.40 (Tiny)
HGRN2 23.10 / 23.73 87.66 75.39 (Tiny)
Griffin-14B – – –

In addition to maintaining state-of-the-art accuracy, GLRNNs exhibit:

  • Linear time and memory scaling (O(Td)O(Td)) in sequence length, compared to quadratic scaling of full self-attention,
  • Superior extrapolation: stable performance at much longer sequence lengths than observed during training,
  • Hardware efficiency: high throughput and low latency at inference due to fixed-state size and diagonal (or block-diagonal) recurrences,
  • Minimal to no regularization requirements; training with Adam or AdamW, warmup schedules, and weight decay is typically sufficient (Qin et al., 2023, Qin et al., 2024, De et al., 2024).

6. Explainability and Theoretical Interpretations

GLRNNs allow for direct extraction of implicit attention matrices from the structure of their recurrences. The unified theory provides:

  • Explicit attention-weight visualization for attributions,
  • Attention rollout and propagation methods,
  • Analysability of long-range dependencies via the gate-product unrolling (Zimerman et al., 2024),
  • Weighted-sum interpretation of state as data-dependent mixing of past inputs, supporting interpretability (Lee et al., 2017).

The implicit self-attention structure bridges architectural and theoretical gaps between RNNs and Transformers, facilitating method transfer and automated analysis tools.

7. Extensions, Limitations, and Future Directions

Contemporary models extend GLRNNs with:

  • Expanded state/memory via outer products, multi-head splitting, or slot-based storage (Qin et al., 2024, Zhang et al., 2024),
  • Hybridization with local/global attentional blocks (e.g., Griffin),
  • Efficient quantization/hardware specialization, especially in addition-based variants.

Challenges remain in further improving recall-heavy task performance, dynamic capacity allocation, integration of adaptive local attention, and architecture-agnostic explainability.

Ongoing empirical work, particularly on Gated Slot Attention and large-scale hybrids, continues to push the performance, scaling, and interpretability boundaries of GLRNNs across modality and benchmark suites (Zhang et al., 2024, De et al., 2024). The consensus is that Gated Linear RNNs offer a highly efficient, theoretically principled, and practically effective alternative to both classical RNNs and the Transformer family.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Linear RNNs.