LRURec for Sequential Recommendation
- The paper introduces a novel linear recurrent architecture that achieves competitive accuracy while drastically reducing training time and hardware overhead.
- Linear recurrent units are defined by replacing nonlinear dependencies with a closed-form, parallelizable recurrence using eigen-decomposition for stability.
- Behavior-dependent gating and hardware-aware parallel scan strategies enable real-time, scalable recommendations on heterogeneous, long user interaction sequences.
Linear Recurrent Units for Sequential Recommendation (LRURec) designate a class of sequential recommender system architectures that model user behavior as a linear, time-evolving process. By leveraging purely linear recurrence relations and recursive parallelization, LRURec and its modern variants achieve high training efficiency, low-latency inference, and competitive recommendation accuracy compared to self-attention and traditional gated RNN models. Recent advances—particularly behavior-dependent gating and hardware-aware parallel scan strategies—generalize LRURec to achieve scalable performance on heterogeneous interaction histories and maximize hardware utilization for real-world datasets (Yue et al., 2023, Liu et al., 2024).
1. Mathematical Foundations of Linear Recurrent Units
The core operation in LRURec replaces typical non-linear RNN or self-attention modules with a purely linear recurrence: where (input), (hidden state), and are learnable matrices. To optimize computational efficiency, is diagonalized as with , and the dynamics are computed in the transformed basis: with element-wise constraints on eigenvalues to guarantee recurrent stability (). This structure enables closed-form prefix computations and recursion-based parallelization for both training and inference (Yue et al., 2023).
2. Recursive Parallelization and Hardware Acceleration
A principal advantage of LRURec is the recursive parallelization achievable via the scan (prefix sum) operator. The hidden state after steps can be decomposed as: Using associative scan identities, the sequence is partitioned and computed hierarchically in time depth, amenable to GPU acceleration. Each round utilizes batched kernels; after left-padding the sequence length to a power of two for maximal hardware utilization, this approach reduces wall-clock training time by an order of magnitude relative to serial recurrence (Yue et al., 2023, Liu et al., 2024).
3. Architectural Enhancements and Nonlinear Augmentation
To circumvent the limited expressivity of purely linear dynamics, LRURec integrates nonlinearity through transformer-inspired modules:
- Layer Normalization:
- Position-wise Feed-Forward Network (PFFN):
- Residual Learning: Post-PFFN output is merged via residual connections and re-normalization.
These enhancements are stacked across multiple blocks (typically ), substantially improving training dynamics and model capacity with minimal impact on computational or memory complexity (Yue et al., 2023).
4. Behavior-Dependent Linear Recurrent Units: The RecBLR Model
RecBLR introduces the Behavior-Dependent LRU (BD-LRU), advancing the static LRU design by:
- Replacing fixed dynamics with input-dependent gating:
where , with per-dimension gates computed from the current input .
- Gating Mechanism:
- Gate-to-Scale Mapping: Per-dimension rates (softplus-parametrized, stabilized) yield
This design ensures dynamic memory/input blending and numerical robustness, replacing overparameterized complex-valued dynamics with streamlined, real-valued, input-responsive recurrence (Liu et al., 2024).
RecBLR's architecture includes:
- Embedding initialization (),
- Multi-layer behavior modeling with BD-LRU and causal convolutions,
- Dropout, residual, and LayerNorm throughout,
- Final softmax scoring over the item vocabulary.
5. Complexity, Scalability, and Inference Properties
LRURec and RecBLR offer theoretical and empirical advantages in time and space complexity compared to conventional RNN and Transformer-based recommenders:
| Model | Training Time (per user) | Inference Time | Memory Footprint |
|---|---|---|---|
| LRURec/RecBLR | |||
| RNN | |||
| Transformer |
RecBLR employs hardware-aware padding and parallel scan acceleration, minimizing compute overhead and leveraging modern GPU kernels (Triton/CUDA) for up/down tree-sweep operations on hidden states. Embedding-only padding circumvents memory blow-up by restricting power-of-two adjustment to BD-LRU entry-points, followed by truncation (Liu et al., 2024).
6. Empirical Results and Ablation Findings
LRURec and RecBLR have been validated on public benchmarks including ML-1M, Amazon, Steam, Gowalla, and XLong (Alibaba) (Yue et al., 2023, Liu et al., 2024). Key results include:
- On ML-1M, RecBLR achieves HR@10=0.3285 and NDCG@10=0.1901, surpassing the best baseline LRURec by 7.5% and 7.3% relative, respectively.
- Across five datasets, RecBLR yields 1.7–9.1% (HR) and 1.3–9.0% (NDCG) relative improvement over competitive methods (FPMC, Caser, GRU4Rec, SASRec, BERT4Rec, FMLP-Rec, LRURec).
- Training time per epoch on long-range datasets (XLong, avg len ≈800): RecBLR (parallel) 263 s vs. LRURec (serial) 595 s, SASRec (quadratic) OOM at T~800.
- In batched online inference, LRURec attains over 7x throughput advantage relative to SASRec at typical sequence lengths.
Ablation studies demonstrate:
- Single recurrent layer induces ≈5% HR@10 reduction.
- Omitting either BD-LRU gating structures or temporal convolution degrades HR/NDCG by 2–8%.
- Larger dropout (0.4–0.5) is beneficial for high-sparsity datasets.
7. Practical Considerations and Deployment
LRURec and RecBLR bridge the "impossible triangle" for sequential recommendation: simultaneous high accuracy, training efficiency, and low-latency inference (Yue et al., 2023, Liu et al., 2024). The closed-form and prefix-scan design enables real-time streaming recommendations, while the low memory profile and decoupling from quadratic sequence dependencies support industrial-scale deployment. Custom initialization (eigenvalue rates) and implementation nuances (left-padding, kernel parallelism, embedding truncation) are essential for stability and efficiency.
Empirical evidence confirms that LRURec and RecBLR robustly scale to long interaction sequences and sparser data domains, with parameter and architectural choices generalizing well across real-world e-commerce, social, and entertainment datasets. A plausible implication is broad applicability for session-based and history-intensive recommender systems that demand stringent latency and hardware constraints.