State Space Sequence Models (SSMs)

Updated 29 August 2025

State Space Sequence Models are a family of architectures based on control theory that model sequential data through state-space equations.
Structured SSMs use advanced parameterizations like HiPPO and DPLR to efficiently capture long-range dependencies and reduce computational complexity.
Empirical benchmarks show SSMs achieving competitive accuracy with linear or near-linear complexity, making them effective in NLP, vision, audio, and time-series applications.

State Space Sequence Models (SSMs) comprise a family of architectures and algorithms grounded in classical control and systems theory, generalized and adapted for modern machine learning tasks that involve processing long, structured, or irregularly sampled sequential data. Recent advances have established SSMs as a highly competitive alternative or complement to recurrent neural networks and attention-based Transformers, particularly when modeling long-range dependencies or achieving computational and memory efficiency is paramount.

1. Mathematical Foundations and Model Structure

SSMs formulate the relationship between inputs, latent states, and outputs using state-space equations, typically either in continuous or discrete time. The classical continuous-time SSM is defined by a system of ordinary differential equations: $\frac{dx(t)}{dt} = A x(t) + B u(t),\quad y(t) = C x(t) + D u(t)$ where $x(t)$ is the latent state, $u(t)$ is the input, $y(t)$ is the output, and $A, B, C, D$ are system matrices.

Discretization, such as via zero-order hold or the bilinear transform, yields the discrete-time analogue: $x_{k} = \bar{A} x_{k-1} + \bar{B} u_{k},\quad y_{k} = C x_{k}$ This formulation admits both classical filters (e.g., the Kalman filter) and recent deep learning-inspired parameterizations.

SSMs can also be unrolled, converting the recurrence into a convolutional kernel that summarizes the impact of past inputs: $y = x * \bar{K},\quad \bar{K} = [C\bar{B},\, C\bar{A}\bar{B},\, C\bar{A}^{2}\bar{B},\, \ldots]$

2. Structured SSMs: Parameterization and Efficient Computation

Recent SSMs, such as S4 and its variants, increase modeling power and efficiency through principled parameterization of the state dynamics:

HiPPO Framework: High-Order Polynomial Projection Operators (HiPPO) yield state matrices (e.g., Legendre-based) that encode the history of the inputs as projections onto orthogonal polynomials, optimizing memory retention for long-range dependencies. For example, the HiPPO–LegS matrix captures inputs as a combination of exponentially-warped Legendre polynomials (Gu et al., 2022).
Diagonal Plus Low-Rank (DPLR) Parameterization: S4 and related models represent the state transition as $A = \Lambda - PQ^{*}$ , enabling diagonalization for fast frequency-domain computation. This supports fast computation of the convolution kernel via the Woodbury identity and FFT-based methods, reducing complexity to nearly $O((N+L)\log N)$ in sequence length $L$ (Gu et al., 2021).
Companion Matrix and AR Processes: SpaceTime parameterizes the state matrix as a companion matrix to efficiently represent autoregressive and exponential smoothing processes, with efficient evaluation of the convolutional kernel via shift-plus-rank-one decompositions and fast Fourier transform acceleration (Zhang et al., 2023).

These structural innovations ensure that SSMs can scale to long sequences and high-dimensional data while preserving both the memory of inputs and efficient computation.

3. Selective and Modular SSMs

To further enhance the flexibility and expressivity of SSMs, several recent lines have introduced selectivity, context gating, and modular state partitioning:

Selective SSMs (Mamba, SeRpEnt): These models make state parameters (e.g., $\bar{A}_k, \bar{B}_k, C_k$ ) input- or token-dependent, enabling dynamic selection of what to store, update, or forget. SeRpEnt utilizes a learned sequence resampling mechanism based on the information content of each token (quantified by $\Delta_\ell$ ), efficiently compressing long inputs without sacrificing essential information (Rando et al., 20 Jan 2025).
SlotSSMs (Modular SSMs): SlotSSMs partition the hidden state into multiple 'slots', each responsible for tracking distinct entities or mechanisms (e.g., objects in video), updating them independently and only allowing sparse interactions via self-attention. This modularization supports efficient and scalable modeling of systems with underlying compositional structure (Jiang et al., 18 Jun 2024).
Gated SSMs: Selective gating functions (often based on sigmoidal units) dynamically filter the hidden state update, controlling the trade-off between information retention and compactness. Theoretical results from rate-distortion and mutual information analysis quantify the efficiency-accuracy frontier in these models (Bhat, 4 Oct 2024).

4. Empirical Performance and Benchmark Results

SSMs, particularly in their structured and selective forms, consistently demonstrate strong empirical performance across standard sequence modeling tasks:

Model/Variant	Benchmark	Accuracy/Metric	Computational Benefit
S4 (HiPPO/LegT)	LRA (Path-X)	91–96%	Linear (FFT-based)
S5 (MIMO SSM)	LRA (Path-X)	98.5%	Parallel scan, time-domain recurrence
SpaceTime (Companion SSM)	Informer Benchmarks	Best MSE on 14/16 tasks	O(d+ $\ell$ ) FFT-based convolution
W4S4 (Wavelet SSM)	Delay tasks, classification	5.3× reduction in log-MSE vs. HiPPO	Stable diagonalization, truncation
SeRpEnt+Mamba	WikiText-103	1.2% ↑ Top-1 accuracy	Information-aware compression

These results are robust across tasks such as long-context language modeling, sequential image recognition, time-series forecasting, speech/audio processing, video understanding, and object-centric learning.

5. Scalability, Efficiency, and Software

The linear or near-linear computational complexity of SSM layers (owing to convolutional unrolling, parallel scans, and fast frequency-domain algorithms) makes them particularly scalable for industrial-scale applications. Dedicated software frameworks—such as SSMProblems.jl and GeneralisedFilters.jl in the Julia/Turing.jl ecosystem—facilitate compositionality and modular modeling, with GPU-optimized implementations of both Kalman and particle filters for large-scale inference (Hargreaves et al., 29 May 2025).

On the hardware side, dedicated accelerators based on systolic array architectures have demonstrated remarkable speedups (e.g., EpochCore achieves ~2,000× improvement in inference latency on LRA datasets compared to GPU kernels) and significant energy efficiency gains for SSM-based models (Raja et al., 29 Jul 2025).

6. Applications, Limitations, and Future Directions

SSMs are widely applicable to:

Natural language processing: long-context modeling, document retrieval, abstractive summarization
Vision and video: long-horizon action recognition, video reasoning, object-centric segmentation
Audio/speech: raw audio generation, speech recognition, enhancement, separation
Time series and multi-modal sequence data: forecasting, nowcasting in economic indicators, medical and genomics sequence analysis, scientific data streams

While the expressivity of SSMs in context-dependent manipulation is less than attention-based models, ongoing work explores hybrid architectures and modular extensions to address this gap (Gupta et al., 2022, Patro et al., 24 Apr 2024).

Theoretical work is converging to provide generalization guarantees, data-dependent initialization (e.g., scaling rules for robustness across temporal statistics), and novel regularization strategies (Liu et al., 4 May 2024). Mathematical advances in structured state initialization (e.g., via wavelet frames (Babaei et al., 9 Jun 2025)) and dynamic online-learning interpretations (e.g., Longhorn as amortized online regression (Liu et al., 19 Jul 2024)) continue to expand the flexibility and applicability of SSMs.

7. Conclusion

State Space Sequence Models represent an overview of dynamical systems theory, stochastic processes, and modern machine learning, achieving efficient and effective sequence modeling for long or complex data. The field is rapidly evolving, with current research focusing on enhancing modularity, expressivity, theoretical understanding, and scalable implementation for a widening range of scientific and industrial applications.