Linear Recurrent Units (LRUs)

Updated 28 February 2026

Linear Recurrent Units (LRUs) are defined by a linear hidden-state update using matrices A and B, enabling efficient, closed-form parallel scan implementations.
Variants like diagonal, block-diagonal, and higher-order LRUs incorporate gating and nonlinearity to boost expressivity and stability across diverse sequence tasks.
Empirical studies demonstrate that LRUs can outperform traditional RNNs and SSMs in long-range modeling while reducing computational costs in time-series, language, and video applications.

Linear Recurrent Units (LRUs) are a foundational class of neural architectures that implement hidden-state evolution via a linear recurrence, optionally combined with parallelization-enabling parameterizations and lightweight gating mechanisms. They unify properties of classical recurrent neural networks (RNNs) and state-space models (SSMs), and underpin algorithmic, efficiency, and expressivity advances across time-series, language, video, RL, and recommender tasks.

1. Mathematical Definition and Core Variants

At their core, LRUs evolve a hidden state $h_t$ using a linear transformation of the previous state and the current input: $h_t = A\,h_{t-1} + B\,x_t,\quad h_0=0$ where $x_t\in\mathbb{R}^{d_x}$ is the input at time $t$ , $h_t\in\mathbb{C}^{d_h}$ or $\mathbb{R}^{d_h}$ is the hidden state, $A\in\mathbb{C}^{d_h\times d_h}$ (often complex, diagonal or block-diagonal), $B\in\mathbb{C}^{d_h\times d_x}$ , with $A, B$ learned or partially fixed. Output is typically $y_t = C\,h_t + D\,x_t$ with $C, D$ linear.

Variants exist:

Diagonal/Diagonalizable LRU: $A$ is diagonalizable, e.g., $A = P\,\Lambda\,P^{-1}$ , enabling parallel scan and stable long-sequence computation. Polar representation $\lambda_i = \exp(-\nu_i + j\,\theta_i)$ is standard (Ling et al., 2 Feb 2026, Yue et al., 2023, Liu et al., 11 Apr 2025).
Block-Diagonal (BD-LRU): $A$ is block-diagonal, allowing dense mixing within subspaces and increased expressivity at moderate computational cost (Dubinin et al., 12 Feb 2026).
Higher-Order LRU (H-LRU): Recurrence leverages the last $m$ hidden states, generalizing classic lag-one updating to higher-order autoregression (Dubinin et al., 12 Feb 2026).
Gated LRU/Behavior-Dependent LRU: Per-timestep gates modulate state- and input-mixing coefficients. Input- and recurrence-gates are computed as affine or MLP projections of $x_t$ , then mapped via sigmoid/exp (Pătrăucean et al., 2024, Liu et al., 2024).
Lattice Recurrent Unit: Two-way (depth/time) coupled LRUs with decoupled flows, allowing preservation of information along both axes in multilayer stacks (Ahuja et al., 2017).

Closed-form unrolling is always possible: $h_k = \sum_{j=0}^{k-1} A^j\,B\,x_{k-j}$ This formula provides the mathematical avenue for parallel (e.g., scan-based) implementation.

2. Diagonalization and Parallelization

For general $A$ , sequential computation is required. If $A$ is diagonalizable ( $A = P\,\Lambda\,P^{-1}$ ), then $\Lambda$ is diagonal and the recurrence decomposes componentwise: $\bar h_k = \Lambda\,\bar h_{k-1} + \bar B\,x_k,\quad \bar h_k = P^{-1}h_k, \bar B = P^{-1}B$

$h_k = \sum_{j=0}^{k-1}\Lambda^j\,\bar B\,x_{k-j}$

This structure allows:

$O(L\,d_h)$ per-sequence cost, or $O(\log L)$ via recursive parallelization (Blelloch scan, FFT-style convolution) (Yue et al., 2023, Liu et al., 11 Apr 2025).
Linear state evolution with stable eigenvalue constraints ( $|\lambda_i|<1$ for all $i$ ), directly controlling signal decay versus memory (Ling et al., 2 Feb 2026).

Parallel scan over a binary tree yields $O(\log T)$ step-wise depth for $T$ -length sequences, critical for hardware-accelerated training (Liu et al., 2024).

3. Integration of Nonlinearity and Gating

Purely linear LRUs are theoretically universal on time-series but lack practical expressivity for complex tasks. Recent advances incorporate lightweight nonlinearity:

Layerwise/Blockwise Nonlinearities: LayerNorm, BatchNorm, Feed-Forward MLPs (PFFN), and activation functions (GELU, SiLU) are applied before or after the LRU block (Yue et al., 2023, Pătrăucean et al., 2024, Liu et al., 2024).
Gated Linear Units (GLU): Nonlinear gating of intermediate representations via

$\mathrm{GLU}(u,v) = u\odot \sigma(v)$

Behavior-Dependent Gating: Gates $(i_t, r_t)$ depend solely on $x_t$ (not $h_{t-1}$ ), preserving trainability and allowing for time-parallel computation. Elementwise gating schedules per-channel memory retention and input injection (Pătrăucean et al., 2024, Liu et al., 2024).
Residual Connections: Output is $x_t+\mathrm{GLU\_path}(h_t)$ or similar, enhancing depthwise signal flow (Ling et al., 2 Feb 2026).

Gated and nonlinearly-augmented LRUs (e.g., TRecViT, RecBLR) outperform purely linear or classic RNN architectures on complex sequence modeling and exhibit superior convergence (Pătrăucean et al., 2024, Liu et al., 2024).

4. Empirical Behavior and Expressivity

LRUs harness linear time and memory complexity for long-sequence tasks, with competitive or superior performance to Transformer or LSTM baselines across domains:

Long-Range Modeling: Proven capacity for long-term dependencies due to explicit eigenvalue control and stabilized recurrence (Ling et al., 2 Feb 2026, Liu et al., 11 Apr 2025).
Hierarchical Temporal Filtering: Deep LRU stacks exhibit emergent frequency filtering; early layers retain high-frequency, deeper layers focus on slow oscillations (Gallicchio et al., 2017). This is a direct consequence of repeated block-lower-triangular or banded recurrence.
Synthetic and Real Datasets: LRUs outperform SSMs (e.g., S4), LSTMs, and Transformers in time-series forecasting, sequence prediction, and sequential recommendation with lower parameter and compute budgets (Yue et al., 2023, Liu et al., 2024, Liu et al., 11 Apr 2025, Dubinin et al., 12 Feb 2026).
Block-Diagonal and Higher-Order Enhancements: BD-LRU and H-LRU models extend expressivity to permutation composition and complex dynamic tasks, with performance matching or exceeding dense and nonlinear models, especially for $m$ (block/order) between 3–5 (Dubinin et al., 12 Feb 2026).

The table below summarizes select empirical results (from (Ling et al., 2 Feb 2026, Yue et al., 2023, Liu et al., 2024, Dubinin et al., 12 Feb 2026)):

Task	LRU Variant	SOTA Baseline	Notable Metrics
Rotated online handwriting recognition	Stacked w/ gating	S4, LSTM	$99.62\%$ (digits), $94.33\%$ (rad.)
Long recommendation (XLong, ML-1M)	LRU, BD-LRU, RecBLR	SASRec, BERT4Rec	+3–9% Recall@10, $\sim$ 2x speed
Permutation composition ( $S_5$ )	BD-LRU ( $m\geq5$ )	LSTM, SSM	$1.00$ acc. ( $m=5$ ), $0.21$ ( $m=1$ )
Sequence copy/compress/recall	BD-LRU, H-LRU	LSTM, Mamba, DeltaNet	BD-LRU outperforms at $m=3,5$

5. Theoretical Properties and Training Protocols

LRUs, by virtue of their linearity and spectral parameterization, admit several salient theoretical and optimization advantages:

Closed-Form Solutions: In supervised time-series, output weights can be trained exactly by linear regression, with spectral pruning reducing network size for minimal function representations (Stolzenburg et al., 2018).
Stability and Memory: Recurrence eigenvalues (magnitude $<1$ ) directly regulate memory timescales, with phase encoding oscillatory patterns—crucial for tasks like superimposed oscillator prediction (Stolzenburg et al., 2018, Gallicchio et al., 2017).
Efficient RTRL: For RL and continual learning, diagonal or block-diagonal LRUs allow $O(n^2+nd)$ per-step real-time recurrent learning (RTRL), tractably supporting fully online updates (Elelimy et al., 2024).
Parallelization: Hardware-aware scan and tree-decomposition enable $O(\log T)$ sequence processing depth for both training and inference (Liu et al., 2024, Yue et al., 2023).
Loss and Optimization: Cross-entropy and MSE losses are typical; weight decay, dropout (0.1-0.3), Adam with learning rate decay, and layer/batch normalization are widely adopted (Ling et al., 2 Feb 2026, Yue et al., 2023, Liu et al., 11 Apr 2025).

6. Advanced Extensions and Hybrid Architectures

Recent works explore directions to further close the efficiency-expressivity gap:

Block-Diagonal and Higher-Order State Mixing: BD-LRU and H-LRU architectures allow intra-block dense mixing and higher-order autoregression, respectively, with normalization-stabilized gating (Dubinin et al., 12 Feb 2026). BD-LRU with moderate block size ( $m=3…5$ ) achieves expressivity competitive with LSTM and parameter-matched SSMs.
Lattice LRUs: By decoupling temporal and depth flow via explicit dual-stream gates and projections, Lattice LRUs (UG-LRU, RG-LRU, PS-LRU) achieve faster convergence and stronger statistical efficiency, particularly in low-resource contexts (Ahuja et al., 2017).
Recurrent Trace Units (RTU): For online RL with RTRL, RTUs extend LRU to two coupled real-valued channels per state with internal nonlinearities, further improving partial observability performance at negligible additional cost (Elelimy et al., 2024).
Gated Video Transformers: In TRecViT, gated LRUs provide causal, memory-efficient alternatives to temporal self-attention for large video contexts, with competitive or improved accuracy and significantly reduced compute (Pătrăucean et al., 2024).

7. Limitations and Open Directions

While LRUs advance sequence modeling efficiency, several limitations persist:

Expressivity Limit: Purely diagonal or linear LRUs, though theoretically universal for time-series, may require greater stacking or gating to match highly nonlinear or high-frequency input-output mappings—hence the turn to BD-LRU or hybrid methods (Dubinin et al., 12 Feb 2026).
Complex-Valued Computation: Implementation in real-valued frameworks may require careful handling or doubling state dimensions to accommodate complex arithmetic (Ling et al., 2 Feb 2026, Elelimy et al., 2024).
Nonlinear Dynamics: Highly nonlinear sequence dynamics or multimodal distributions challenge shallow or ungated LRUs. Deep stacking and additional nonlinear projection/gating layers are standard remedies (Yue et al., 2023, Ling et al., 2 Feb 2026).
Task-Specific Adaptation: Further tuning of block/order versus total width, normalization regimes, and residual connections remain active research questions, particularly as model sizes grow and task requirements vary (Dubinin et al., 12 Feb 2026).

Continued exploration of hybrid architectures, hardware-efficient scan algorithms, and more expressive structured mixing are expected to broaden LRU applicability and close the remaining performance gap to dense or attention-based models.