Linear Sequence Modeling: Scalable Methods

Updated 15 August 2025

Linear sequence modeling is a class of methods that process sequential data efficiently using linear complexity techniques like structured recurrences and kernelized attention.
These models employ a unified framework with stages—Expand, Oscillation, and Shrink—to enable scalable training and inference across diverse domains such as language, vision, and audio.
Architectures including state space models, linear attention, and mixture-of-memories deliver improved memory management, parallelism, and performance over traditional quadratic models.

Linear sequence modeling encompasses a class of methods designed to process and model sequential data efficiently using architectures with linear complexity in sequence length. In contrast to traditional attention-based models—such as the Transformer—which have quadratic complexity due to full pairwise interactions, linear sequence models employ structured recurrences, convolutions, low-rank representations, or linearized attention mechanisms to achieve scalability on extremely long sequences. This area unifies techniques drawn from linear dynamical systems, convolutional signal processing, control theory, recent advances in neural network design (e.g., state space models, gated recurrence), and hardware-aware parallelism. The field is motivated by both theoretical insights into recurrent/compositional function spaces and by pragmatic demands for efficient large-scale sequence modeling in language, vision, audio, and scientific applications.

1. Unified Theoretical Frameworks

Recent research has established unifying perspectives that accommodate a broad array of linear sequence models under a single compositional structure. The Linear Complexity Sequence Model (LCSM) framework (Qin et al., 27 May 2024) provides a canonical representation in which sequence modeling is subdivided into three stages: Expand (mapping input to a high-dimensional memory state), Oscillation (recursive state update via element-wise or matrix operations), and Shrink (projection to output space). In this formulation, models such as linear attention, state space models (SSMs), long convolution, and linear RNNs are defined by their specific instantiations of the EOS stages and choice of data-dependent or hand-crafted parameterization.

Mathematically, the recurrent update is often written as:

$m_t = g_\psi(o_t, m_{t-1}) + e_t i_t^\top$

$y_t = m_t^\top s_t$

where $g_\psi$ is a binary operator (elementwise or matrix multiplication), $e_t$ , $i_t$ , $o_t$ , $s_t$ are the various memory-related states, and the specifics vary per architecture. This unification reveals that the principal distinction between classes of linear sequence models is the mechanism for state expansion, state update, and output projection, with further nuance introduced by whether these mechanisms are data-driven or hand-crafted.

2. Core Architectures and Mechanisms

Linear sequence modeling encompasses multiple architectural paradigms:

Linear (Kernelized) Attention: Models such as linear Transformers (Afzal et al., 22 Feb 2025) employ a kernel trick to factorize the attention computation as $O = Q (K^\top V)$ , which is amenable to recurrent or parallel scan computation. Training is performed with parallelizable full attention, while inference is performed as a linear-time recurrent update.
State Space Models: SSMs and their deep learning instantiations (e.g., S4/S5 (Smith et al., 2022), Mamba (Gu et al., 2023)) exploit parameterized linear dynamical systems with data-dependent or fixed dynamics, recurrences of the form $x_{t+1} = A x_t + B u_t$ with outputs $y_t = C x_t + D u_t$ (in continuous or discretized time). Selective SSMs (Mamba) use input-dependent modification of transition parameters for content-based information flow.
Toeplitz Neural Networks and Structured Convolutions: The TNN (Qin et al., 2023) uses a Toeplitz matrix for token mixing, parameterizing the mixing kernel as a function of relative positions and leveraging O( $n\log n$ ) FFT-based computation, decoupling parameter count from sequence length.
Gated and Hierarchical Linear RNNs: Variants such as HGRN (Qin et al., 2023) use data-dependent gating, often with complex-valued recurrences and hierarchical lower bounds on forget gates, enabling control over the temporal receptive field at each layer.
Bidirectional Linear Recurrence: Frameworks like LION (Afzal et al., 22 Feb 2025) and BLUR (Liu et al., 11 Apr 2025) derive bidirectional recurrent formulations mathematically equivalent to full (non-causal) linear attention, enabling parallel training and efficient bidirectional inference.

Common to these architectures are techniques such as kernelized recurrence, explicit gating (additive or multiplicative), moment-based bias correction in learning algorithms (as in the MES approach (Lin et al., 2017)), and careful engineering for SIMD-friendly GPU execution (kernel fusion, parallel scan, etc.).

3. Memory Management and Capacity

A foundational limitation of earlier linear models was the compression of sequence information into a single memory state, leading to poor performance on recall-intensive tasks. Novel memory architectures have emerged:

Mixture-of-Memories (MoM): MoM (Du et al., 19 Feb 2025) introduces multiple independent memory slots per layer, with a router network directing each input token to a subset of memories (via top-k selection and normalization). Memory states are updated selectively and mixed via importance scores, mitigating interference and augmenting effective memory capacity while maintaining linear complexity.
Memory-Augmented Transformers: Memformer (Wu et al., 2020) and Gated Slot Attention (GSA) (Zhang et al., 11 Sep 2024) employ external, fixed-size memory slots. GSA uses a two-pass Gated Linear Attention (GLA) mechanism to recursively update slot content, followed by softmax-based context retrieval. This explicit separation of storage and retrieval, combined with data-dependent gating, improves recall and efficiency, especially in low-memory or real-time settings.

The performance of such architectures on recall-dependent benchmarks (e.g., multi-query associative recall, extractive QA) exceeds that of standard linear models, demonstrating the necessity of architectural innovations in memory routing and management.

4. Efficiency and Parallelism

Linear sequence models are designed to minimize both time and memory complexity:

Hardware-Aware Execution: Many linear models, especially recent SSMs and GLA/GLA-like variants (Liao et al., 28 May 2024), are implemented with single-kernel forward/backward scans, fusing multiple computational steps to maximize SRAM bandwidth on GPUs and reduce memory I/O. This design is critical to realizing theoretical complexity gains in practice.
Sequence Parallelism: Advances such as Linear Attention Sequence Parallelism (LASP) (Sun et al., 3 Apr 2024) capitalize on the "right-product-first" property of linear attention, enabling efficient partitioning of long sequences across devices with only $O(d^2)$ communication per head per sub-sequence (ring-based batched transmission of $K^\top V$ states), independent of sequence length.
Scalability to Extreme Sequence Lengths: Models like TNN and LASP demonstrate the ability to generalize far beyond their training lengths (e.g., from 512 to 14,000+ tokens, or up to 4 million tokens distributed across 128 GPUs), with constant or sublinear overhead in terms of hardware requirements.

These advances have facilitated scalable pretraining and inference for billion-parameter models on corpora and tasks requiring memory and reasoning over very long sequences.

5. Empirical Results and Benchmarks

The empirical superiority and efficiency of modern linear sequence models have been established across diverse benchmarks:

Model	WikiText-103 Test PPL	LRA Average (%)	Recall-Intensive Tasks	Throughput / Memory
Mamba-3B	Matches or surpasses Transformers twice its size	State-of-the-art on language/audio	Strong on induction/selective copy	4–5× faster than Transformer
S5 (Smith et al., 2022)	–	87.4 (LRA), 98.5 (Path-X)	–	Matches S4; parallel scan
TNN (Qin et al., 2023)	24.67	Competitive or superior	Extrapolates to 14K	O(n log n) time, <memory>
MoM (Du et al., 19 Feb 2025)	–	–	27.6 (FDA, SQuAD, NQ...)	O(1) inference, near-Transformer recall
Linear-MoE (Sun et al., 7 Mar 2025)	Competitive	–	–	Maintains constant memory at high context
LASP (Sun et al., 3 Apr 2024)	–	–	–	4096K sequence on 128 GPUs, +38% throughput
ViG (Liao et al., 28 May 2024)	20%–27% of DeiT-B FLOPs	>20% higher accuracy on $1024\times1024$ images	–	4.8× speedup, 90% memory saved

Performance analyses consistently highlight that, with careful design (e.g., gating, mixture-of-memories, hardware-efficient parallelism), linear sequence modeling can match or exceed transformer baselines across languages, time-series, vision, and scientific data modalities, especially when sequence lengths are extreme or memory/latency constraints are tight.

6. Theoretical and Practical Implications

Foundational work (e.g., (Afzal et al., 22 Feb 2025, Gu et al., 2023, Katsch, 2023)) has established equivalences between linear attention, bidirectional recurrence, and selective gating:

Theoretical Unification: The equivalence between bidirectional linear attention and a pair of directional RNNs (LION), and the identification of selective state transitions as content-dependent relative positional encodings (GateLoop), blur the classical boundaries separating convolution, recurrence, and attention.
Control Theory Integration: New selection mechanisms (Casti et al., 23 May 2025) leverage residual generation techniques from LTI system theory as externally gated feedback for dynamic attention, preserving LTI properties while achieving selection comparable to LTV methods (e.g., Mamba) but with enhanced scalability and stability.
Learning Challenges and Sample Complexity: Analysis of higher-order moment bias (e.g., skewness/kurtosis), as in the MES method for the Second Order Linear Model (Lin et al., 2017), clarifies that naive gradient descent is often insufficient for high-order sequence interactions; gradient-free or moment-corrected approaches are essential for tractable, consistent parameter estimation.

The field's rapid development has direct implications for the design of foundation models optimized for linear scaling, hybrid mixture-of-experts integration, memory-constrained inference, and explicit handling of long-term dependencies in demanding applications.

7. Applications and Future Directions

Linear sequence modeling architectures now underpin production-scale foundation models, high-throughput streaming inference systems, and long-sequence scientific or image/vision models. Notable use cases include:

Language Modeling: Foundation models incorporating Mamba, GateLoop, MoM, or Linear-MoE architectures demonstrate competitive or state-of-the-art perplexity and generalization on extensive natural language benchmarks.
Vision and Multimodal Tasks: Bidirectional gated linear attention (ViG), TNN-based token mixing, and HGRN mixers outperform quadratic transformers and CNN backbones in both accuracy and computational cost on ImageNet, segmentation, and object detection tasks.
Time-Series and Forecasting: BLUR (Liu et al., 11 Apr 2025) excels in forecasting with linear complexity and provable stability/approximation bounds, critical for domains like weather prediction and energy load monitoring.
Memory and Recall: Mixture-of-memories, slot attention, and external memory architectures deliver improvements for tasks where recall of distant information is essential (e.g., QA, dialogue, code completion).

Ongoing directions include further integration of control theory (for interpretable, provably robust selection), scaling hybrid models to ultra-long contexts, optimizing hardware-software co-design (kernel fusions, efficient memory pipelines), and characterizing the interplay between data-driven and hand-crafted recurrent/oscillation modules for different classes of sequence tasks.

In totality, linear sequence modeling represents an overview of deep learning, signal processing, dynamical systems, and distributed systems engineering. Progress in this area continues to redefine the practical and theoretical frontiers of scalable sequential data modeling in scientific and industrial domains.