Gated Linear RNNs: Efficient Sequence Models
- Gated Linear RNNs are sequence models that combine linear recurrences with data-dependent gating to achieve efficient training and robust extrapolation.
- They unify a spectrum of architectures—including HGRN, RG-LRU, and Gated Slot Attention—demonstrating competitive performance and hardware efficiency.
- Their design enables implicit self-attention through dynamic gating and structured recurrences, leading to linear time scaling and enhanced interpretability.
Gated Linear Recurrent Neural Networks (Gated Linear RNNs, GLRNNs) are a class of sequence models that combine linear recurrences with data-dependent gating, yielding architectures with highly efficient training and inference, robust extrapolation, and a strong connection to various forms of (linear) self-attention. Recent advances have unified a spectrum of models under the GLRNN regime, including HGRN, HGRN2, RG-LRU, Griffin, Gated Slot Attention, and others. These architectures have demonstrated competitive or superior performance to Transformers on multiple sequence modeling tasks, while offering hardware efficiency, explainability, and theoretical tractability.
1. Core Principles and Model Formulations
The central design of Gated Linear RNNs comprises two components: a (linear or affine) diagonal recurrence and data-dependent element-wise gates. A canonical formulation is: where and are "forget" and "input" gates (typically with parameterized via sigmoid nonlinearities), and is the input at time . Some variants exploit pure addition with ReLU gating instead of standard sigmoids and multiplications (Brännvall et al., 2023).
The gates control the flow of information and dynamic memory retention. Notably, the states can often be shown to be convex, data-dependent, per-component mixtures of past inputs, giving the recurrences equivalence to a form of implicit attention (Lee et al., 2017, Zucchet et al., 2023, Zimerman et al., 2024).
Modern GLRNNs generalize this basic structure by:
- Including complex-valued recurrences and fixed relative-position rotations (e.g., HGRN) (Qin et al., 2023),
- Employing bilinear gates and output projections to realize exact linear self-attention layers (Zucchet et al., 2023),
- Expanding hidden state via structured or outer-product growth for enhanced expressivity (e.g., HGRN2) (Qin et al., 2024),
- Integrating gating into linear attention (GLA) or slot-based architectures to improve recall and memory efficiency (Zhang et al., 2024).
In all cases, the architectures are designed for inference and training per layer per sequence of length with hidden size , and support efficient parallelization.
2. Hierarchically Gated RNNs and State Expansion
The Hierarchically Gated Recurrent Neural Network (HGRN) introduced a layered gating hierarchy, wherein each layer enforces a monotonic, learnable lower bound on the data-dependent forget gate: The lower bound is computed as a cumulative sum over a softmax across layers, enforcing that lower layers emphasize local/short-term dependencies, while higher layers capture longer-term context (Qin et al., 2023).
HGRN2 extends this by state expansion using an outer product mechanism: The hidden state becomes matrix-valued (up to ), drastically increasing capacity with negligible additional parameters. HGRN2 thereby matches the expressiveness of linear attention mechanisms, while supporting hardware-efficient training using blockwise matrix operations (Qin et al., 2024).
3. Connection to Attention and Unified Implicit Causal Self-Attention
GLRNNs are mathematically equivalent to implicit, data-dependent causal self-attention layers, as shown by the mapping: where is a diagonal gating matrix, is a lower-triangular, data-dependent attention matrix parameterized via the recurrence, is an activation branch, and is a causal filter (e.g., Conv1D) (Zimerman et al., 2024).
This framework encompasses Mamba, RWKV, Griffin (RG-LRU), GateLoop, HGRN2, and related models, unifying them as sub-quadratic, attention-like operators. It has led to explainability methods for GLRNNs analogous to Transformer-based attention visualization, attribution, and rollout procedures, with empirical evidence that these provide faithful and sharp relevance maps in both vision and language contexts (Zimerman et al., 2024).
Gated Linear RNNs can exactly implement (linear) self-attention by orchestrating their gating, state, and output projections to replicate the "key–value–query" structure of attention. Gradient descent converges on this solution in practice, as observed empirically (Zucchet et al., 2023).
4. Architectural Spectrum and Variants
The GLRNN design space encompasses:
- Purely additive networks (RANs): without nonlinear recurrences; performance on par with LSTM for language modeling, at lower parameter and compute cost (Lee et al., 2017).
- Bilinear and multiplicative gates: Used in attention-equivalent GLRNNs to manufacture exact key, value, and query representations (Zucchet et al., 2023).
- Addition-based Gated RNNs: Replace sigmoid and multiplication with ReLU and addition, providing substantial computational savings in plaintext and homomorphic encryption settings, while preserving long-term memory and sequence learning power (Brännvall et al., 2023).
- RG-LRU (Griffin): Employs a recurrence gate and input gate; integrates into hybrid architectures with local attention, achieving hardware-efficient training, excellent extrapolation, and high accuracy at scale (De et al., 2024).
- Slot attention and bounded-memory extensions: Gated Slot Attention (GSA) manages a matrix-valued slot state and realizes a two-pass Gated Linear Attention, linking via softmax to stabilize gradients and enhance recall (Zhang et al., 2024).
5. Empirical Performance and Scaling Properties
GLRNNs achieve highly competitive results on a wide range of benchmarks, often equaling or outperforming efficient Transformer derivatives and prior RNNs:
| Model | Language Modeling (WT-103, PPL) | Long-Range Arena (Avg. Acc) | ImageNet-1k (Top-1%) |
|---|---|---|---|
| Transformer | 24.40 (val) / 24.78 (test) | – | 72.20 (DeiT-Tiny) |
| HGRN | 24.14 / 24.82 | 86.91 | 74.40 (Tiny) |
| HGRN2 | 23.10 / 23.73 | 87.66 | 75.39 (Tiny) |
| Griffin-14B | – | – | – |
In addition to maintaining state-of-the-art accuracy, GLRNNs exhibit:
- Linear time and memory scaling () in sequence length, compared to quadratic scaling of full self-attention,
- Superior extrapolation: stable performance at much longer sequence lengths than observed during training,
- Hardware efficiency: high throughput and low latency at inference due to fixed-state size and diagonal (or block-diagonal) recurrences,
- Minimal to no regularization requirements; training with Adam or AdamW, warmup schedules, and weight decay is typically sufficient (Qin et al., 2023, Qin et al., 2024, De et al., 2024).
6. Explainability and Theoretical Interpretations
GLRNNs allow for direct extraction of implicit attention matrices from the structure of their recurrences. The unified theory provides:
- Explicit attention-weight visualization for attributions,
- Attention rollout and propagation methods,
- Analysability of long-range dependencies via the gate-product unrolling (Zimerman et al., 2024),
- Weighted-sum interpretation of state as data-dependent mixing of past inputs, supporting interpretability (Lee et al., 2017).
The implicit self-attention structure bridges architectural and theoretical gaps between RNNs and Transformers, facilitating method transfer and automated analysis tools.
7. Extensions, Limitations, and Future Directions
Contemporary models extend GLRNNs with:
- Expanded state/memory via outer products, multi-head splitting, or slot-based storage (Qin et al., 2024, Zhang et al., 2024),
- Hybridization with local/global attentional blocks (e.g., Griffin),
- Efficient quantization/hardware specialization, especially in addition-based variants.
Challenges remain in further improving recall-heavy task performance, dynamic capacity allocation, integration of adaptive local attention, and architecture-agnostic explainability.
Ongoing empirical work, particularly on Gated Slot Attention and large-scale hybrids, continues to push the performance, scaling, and interpretability boundaries of GLRNNs across modality and benchmark suites (Zhang et al., 2024, De et al., 2024). The consensus is that Gated Linear RNNs offer a highly efficient, theoretically principled, and practically effective alternative to both classical RNNs and the Transformer family.