Papers
Topics
Authors
Recent
Search
2000 character limit reached

Real-Gated LRU for Efficient Sequence Modeling

Updated 29 January 2026
  • RG-LRU is a gated linear recurrent architecture that employs input-dependent, real-valued gates with a diagonal recurrence to enable fast, efficient sequence modeling.
  • It diverges from classical RNNs by removing hidden-state dependencies in gating, allowing parallelization and reducing per-step computational complexity to O(D).
  • Integrated within Hawk and Griffin, RG-LRU demonstrates competitive scaling, robust long-context extrapolation, and lower inference latency compared to transformer models.

The Real-Gated Linear Recurrent Unit (RG-LRU) is a gated linear recurrence architecture, central to the Hawk and Griffin LLMs, that combines memory efficiency, fast inference, and stable long-sequence modeling by leveraging real-valued gates and strictly input-dependent parameterization. RG-LRU is designed to enable scalable and hardware-efficient sequence modeling, diverging from classical recurrent architectures in both mathematical formulation and practical performance. Its formulation, implementation, and empirical characteristics are detailed in the context of recent hybrid LLM frameworks (De et al., 2024).

1. Mathematical Formulation

RG-LRU generalizes the gated linear recurrence paradigm, where the evolution of the hidden state is controlled by input-dependent gates and a diagonal recurrence, eschewing dependence on the previous hidden state for gating.

Let xt∈RDx_t \in \mathbb{R}^D denote the input, ht∈RDh_t \in \mathbb{R}^D the hidden state at time tt. The recurrence and input gates are defined strictly as functions of the current input: rt=σ(Waxt+ba),it=σ(Wxxt+bx),r_t = \sigma(W_a x_t + b_a), \quad i_t = \sigma(W_x x_t + b_x), where Wa,Wx∈RD×DW_a, W_x \in \mathbb{R}^{D \times D} and ba,bx∈RDb_a, b_x \in \mathbb{R}^D. The diagonal kernel a=σ(Λ)∈(0,1)Da = \sigma(\Lambda) \in (0,1)^D, parameterized via Λ∈RD\Lambda \in \mathbb{R}^D, ensures constrained and stable recurrences. The recurrence gate ftf_t incorporates a scaling constant cc (set to 8 in practice), as

ft=acrt,f_t = a^{c r_t},

with the update equation: ht=ft⊙ht−1+1−ft2⊙(it⊙xt),yt=ht,h_t = f_t \odot h_{t-1} + \sqrt{1 - f_t^2} \odot (i_t \odot x_t), \quad y_t = h_t, where all operations are elementwise and ⊙\odot denotes Hadamard product. The 1−ft2\sqrt{1 - f_t^2} factor ensures the sum of squared coefficients for ht−1h_{t-1} and (it⊙xt)(i_t \odot x_t) remains unit-norm, conferring layer-norm-like stability. The gates’ exclusive dependence on xtx_t enables efficient parallel scan implementations and removes dynamic recurrent dependencies found in traditional RNNs.

2. Distinctions from Classical RNN Architectures

RG-LRU introduces several features that set it apart from standard RNNs, GRUs, and LSTMs in both computational and functional respects:

  • Recurrence and Gating: RG-LRU gates (rtr_t, iti_t) depend solely on xtx_t, while classical units (Elman RNN, GRU, LSTM) use both xtx_t and ht−1h_{t-1} in nonlinear mixing gates. This design allows RG-LRU to support parallelization across time.
  • Parameterization: The core recurrence in RG-LRU is strictly diagonal (a∈(0,1)Da \in (0,1)^D), as opposed to the dense matrices in traditional RNNs leading to O(D2)O(D^2) cost per step.
  • State Evolution: Classical RNNs mix xtx_t and ht−1h_{t-1} via nonlinear transformations and sequential dependencies, resulting in vanishing/exploding gradients over long sequence lengths. RG-LRU's norm-preserving updates and fixed-size memory vector mitigate these issues.
  • Computational Complexity: RG-LRU operates at O(D)O(D) cost per time step for multiplications and additions, while classical RNNs, GRUs, and LSTMs require O(D2)O(D^2). The fixed-size state avoids the O(TD)O(T D) memory scaling associated with transformer KV caches.
  • Stability: The formulation of ftf_t ensures stability, with ft∈(0,1)f_t \in (0,1) by construction.

This approach yields efficient, stable, and hardware-friendly recurrence suitable for large-scale sequence models.

3. Integration Within Hawk and Griffin Architectures

Hawk employs RG-LRU as the central recurrence module within transformer-style residual blocks, forming a hybrid architecture. The RG-LRU block is preceded by RMSNorm and a small separable Conv1D (kernel size 4) acting as a pre-filter, followed by skip connections. The input xtx_t is processed through a GeLU MLP and Conv1D before entering the RG-LRU module, ensuring rich input transformations without adding recurrent dependencies.

Parameterization uses LeCun initialization for all linear layers; the diagonal recurrence kernel a=σ(Λ)a = \sigma(\Lambda) is initialized to ensure aca^{c} lies in the interval [0.9,0.999][0.9, 0.999] for controlled forgetfulness. Notably, RG-LRU operates strictly with real-valued states, distinguishing it from many recent state-space models (SSMs) that employ complex-valued recurrences; empirical results show real-valued RG-LRU suffices for competitive language modeling.

The Griffin model further combines RG-LRU with local attention modules, yielding a hybrid capable of matching Llama-2 performance with significantly fewer training tokens (De et al., 2024).

4. Computational Complexity and Efficiency

RG-LRU provides substantial computational and memory advantages over both transformer-based attention and classical recurrent models, summarized as follows:

Component Forward/Backward FLOPs (length TT) Memory Traffic
Global Attention O(T2D+TD2)O(T^2 D + T D^2) O(T2D)O(T^2 D)
Local Attention O(TWD+TD2)O(T W D + T D^2) O(TWD)O(T W D)
RG-LRU + Conv1D O(TD)O(T D) O(TD)O(T D)

In inference, RG-LRU maintains a state of O(BD)O(B D) per batch for sequence generation, avoiding the O(BTD)O(B T D) KV cache overhead of transformers. As the sequence length TT increases and T≫DT \gg D, RG-LRU models become increasingly advantageous, achieving lower decode latency and higher throughput due to their parameter-bound, rather than cache-bound, nature.

5. Empirical Performance Characteristics

RG-LRU, instantiated in Hawk and Griffin, demonstrates favorable scaling and downstream task performance:

  • Scaling Laws: Griffin matches or slightly surpasses Llama-2 13B in held-out loss at a fraction of the training token budget (300B vs. 2T tokens). All families show power-law scaling with respect to training FLOPs, but Griffin attains lowest perplexity under matched compute.
  • Downstream Benchmarks: RG-LRU-based models (Hawk, Griffin) outperform or match Mamba, Transformer, and Llama-2 counterparts in similar parameter regimes in MMLU, HellaSwag, PIQA, WinoGrande, ARC-E, and ARC-C evaluations. For example, Griffin-14B achieves 49.5 on MMLU and 81.4 on HellaSwag, comparable to or slightly higher than Llama-2 13B.
  • Long-Context Extrapolation: RG-LRU-equipped models (Griffin, Hawk) trained on 2048-token contexts continue to improve loss up to 4–16×\times longer sequences, whereas transformers with RoPE saturate beyond the training context.
  • Inferential Efficiency: In 1B-parameter settings, decode latency remains flat with increasing sequence length, in contrast to the sharply increasing latency of transformers, yielding up to 2×\times faster inference and 2–3×\times the throughput at long contexts.

6. Implementation Best Practices

RG-LRU modules are strongly memory-bound. Optimal deployment leverages the following practices:

  • Parallelism: Large dense layers are sharded as in Megatron architectures. For RG-LRU gates (Wa,WxW_a, W_x), block-diagonal parameter matrices (e.g., 16 blocks) enable device-level partitioning without cross-communication. Separable Conv1D pre-filters are trivially sharded by channel.
  • Optimizer and Activation Precision: ZeRO stage-2/3 is used to shard optimizer states and gradients across data-parallel replicas. Using bfloat16 for weights/activations effectively halves memory traffic.
  • Hardware Kernels: Efficient RG-LRU computations require fused kernels (Pallas or CUDA) operating over a linear scan of t=1…Tt = 1\ldots T, storing hth_t in fast on-chip memory and fusing gate computations and state updates. This yields up to a 3×\times speedup over naive JAX scan implementations and reduces wall-clock training time by 10–20%.
  • Attention Integration: For hybrid Griffin models, fixed local attention windows (e.g., 1024 when training with 2048 context) achieve nearly equivalent loss to global attention at lower cost.
  • Framework Integration: JAX implementations benefit from XLA custom calls; on PyTorch, analogous CUDA kernels are recommended. Pre-normalization with RMSNorm or LayerNorm is essential before RG-LRU blocks. Weight tying between embeddings and LLM head is standard for memory efficiency.

7. Significance and Context

RG-LRU advances the scalability of sequence modeling by making efficient, norm-stable, and parallelizable recurrence accessible at the scale of modern LLMs. By eliminating hidden-state-dependent gating, using strictly real-valued and diagonal recurrence, and emphasizing hardware alignment, RG-LRU achieves competitive or superior results to transformer models in downstream performance, scaling, and hardware efficiency. Integration in Hawk and Griffin illustrates its capacity for robust long-context extrapolation and throughput, establishing it as a practical alternative or complement to attention-based methods in efficient large-scale language modeling (De et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Real-Gated Linear Recurrent Unit (RG-LRU).