Papers
Topics
Authors
Recent
2000 character limit reached

SparseSSM: Sparse State Space Models

Updated 4 February 2026
  • SparseSSM is a class of structured state space models that leverages sparsity in parameters and transitions to enhance efficiency, controllability, and interpretability.
  • It utilizes diverse methodologies including control-theoretic parameterization, training-free pruning, and Bayesian estimation to reduce computational complexity while maintaining accuracy.
  • Empirical results highlight improvements such as reduced training time, accelerated inference up to 80%, and superior performance in language modeling and dynamical systems benchmarks.

SparseSSM refers to a class of structured state space models (SSMs) and associated methodologies that explicitly encode or discover sparsity in the state transition structure. By enforcing or leveraging sparsity in either the model parameters, hidden state recurrences, transition graph, or computational paths, SparseSSMs achieve regimes of improved efficiency, interpretability, or expressivity relative to dense SSMs. The literature encompasses control-theoretic parameterizations for deep sequence models, token- or weight-pruning approaches for efficient inference, Bayesian and proximal identification algorithms for sparse dynamical systems, structured transition matrix designs for automata emulation, and biologically inspired sparse spiking models.

1. Canonical Sparse Parameterization: Control-Theoretic Forms

SparseSSM architectures rooted in control theory employ canonical forms that guarantee minimal, structured parameterization of the state transition matrix while ensuring critical system properties.

In the Sparse-Mamba (S-Mamba) family, the state transition matrix ARn×nA \in \mathbb{R}^{n \times n} is parameterized using only nn free parameters in companion or transpose-companion (observable canonical) form: A=[0100 0010  0001 a0a1a2an1],A = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0 \ 0 & 0 & 1 & \cdots & 0 \ \vdots & \vdots & \ddots & \ddots & \vdots \ 0 & 0 & 0 & \cdots & 1 \ - a_0 & -a_1 & -a_2 & \cdots & - a_{n-1} \end{bmatrix}, where {a0,,an1}\{a_0, \ldots, a_{n-1}\} are the only learnable parameters. This structure ensures O(n)O(n) nonzero entries, provable controllability for SSMs with B=[0,...,0,1]TB = [0, ..., 0, 1]^T, and full observability in the transpose form with C=[0,...,0,1]C = [0, ..., 0, 1] (Hamdan et al., 2024).

In Stable-Mamba2 (ST-Mamba2), AA is diagonal with all entries negative or slightly negative (by clamping ai0a_i \geq 0 to 1×105-1 \times 10^{-5}), ensuring asymptotic stability of the state sequence.

This class of SparseSSMs delivers:

  • Guaranteed controllability, observability, and stability by construction.
  • Linear parameterization and computational complexity in state dimension nn and sequence length TT (i.e., O(Tn)O(Tn) as opposed to O(Tn2)O(Tn^2) for dense AA).
  • Consistent empirical improvements in perplexity (up to 5%5\%) and training time reductions (3%3\%) relative to dense baselines (Hamdan et al., 2024).

2. Training-Free Sparse Pruning: OBS-Derived and Token Pruning Methods

SparseSSM also designates one-shot, training-free pruning algorithms for deep SSM-based networks. The SparseSSM framework (Tuo et al., 11 Jun 2025) develops a saliency-based pruning protocol derived from the Optimal Brain Surgeon (OBS) perspective.

The main elements are:

  • Computation of a diagonal saliency score for each SSM weight, based on the local diagonal Hessian of the output loss, yielding Id,nlog(Ad,nlog)2b,ihb,i1,d,n2I_{d,n}^{\log} \propto (A^{\log}_{d,n})^2 \cdot \sum_{b,i} h_{b, i-1, d, n}^2 with the sum taken over all mini-batches and time steps.
  • Layer-wise pruning by aggregating importance across time and selecting weights for removal based on saliency frequency (Algorithm 1 in (Tuo et al., 11 Jun 2025)).
  • Sensitivity-adjusted pruning budgets for FFN blocks using Hessian-trace information, ensuring critical layers are pruned less aggressively.

Pruning 50% of SSM weights in Mamba models (at scales up to 1.4B parameters) yields no degradation in zero-shot accuracy, outperforming magnitude and SparseGPT-based methods. Structured (column) pruning accelerates inference by 1.72×1.72\times without loss in language modeling or zero-shot task performance. Analysis reveals redundancy is highly concentrated in a small subset of channels (Tuo et al., 11 Jun 2025).

The Simba methodology extends sparsification to the token dimension, utilizing a closed-form measure of token influence in sequence models. Upper layers are aggressively pruned, forming a hierarchical "highway" that maintains global information flow while reducing redundant computation (2505.20698). Simba achieves up to 80%80\% real speedup and improved accuracy under matched FLOPs budgets.

3. Graph-Sparse and Bayesian SSM Estimation

For time-series parameter estimation, Bayesian and optimization-based SparseSSM methodologies incorporate explicit sparsity constraints or priors on the transition structure.

The SpaRJ framework (Cox et al., 2023) leverages a spike-and-slab prior for the transition matrix AA in linear-Gaussian SSMs. The model augments parameter sampling with binary indicators zij=1z_{ij}=1 if Aij0A_{ij}\neq 0, and deploys reversible-jump MCMC across model spaces to explore different sparsity patterns. The approach yields:

  • Ergodic posterior consistency under standard conditions,
  • Recovery of sparse, interpretable graph structures with uncertainty quantification,
  • Empirical superiority in RMSE and F1-score relative to EM-based and graphical approaches for high-dimensional systems.

For nonlinear systems, GraphGrad (Cox et al., 2024) combines differentiable particle filtering with 1\ell_1-sparse polynomial parameterization of the SSM transition function. By integrating stochastic proximal-gradient updates and a degree-elimination mechanism, GraphGrad consistently recovers the true interaction structure in benchmark systems (e.g., Lorenz-63/96) and scales to thousands of unknowns.

4. Structured Sparse Transition Matrices for Expressivity

A distinct class of SparseSSM models achieves both expressivity and efficiency by exploiting structured sparsity in the transition matrix.

The PD-SSM architecture (Terzić et al., 26 Sep 2025) replaces a dense A(ut)A(u_t) by a product P(ut)D(ut)P(u_t) D(u_t), where PP is a column-one-hot selection matrix and DD is diagonal (possibly complex). PD-SSM attains:

  • Parallel scan complexity O(NL)O(NL) (matching diagonal SSMs),
  • Universal emulation of any NN-state finite-state automaton (FSA) with NN-dimensional state and a linear N×NN\times N readout,
  • BIBO (bounded-input, bounded-output) stability under mild parameter constraints.

Empirically, PD-SSM establishes new generalization records on FSA state-tracking and multivariate time-series benchmarks, and integrates seamlessly with hybrid Transformer-SSM architectures.

5. Sparse Spiking State Space Models

SparseSSMs extend to energy-efficient, biologically inspired sequence models through the use of spiking neurons, imposing event-driven sparsity at the activation or routing level.

SpikingSSM (Shen et al., 2024) couples standard SSM recurrences with leaky integrate-and-fire (LIF) nonlinearity. To circumvent the inherent sequential nature of update and spike generation, a surrogate dynamic network (SDN) predicts spike times in parallel, admitting O(T)O(T) GPU-efficient training. Across Long Range Arena (LRA) and WikiText-103, SpikingSSM surpasses prior spiking LLMs in perplexity and maintains 90%\sim90\% sparsity.

SPikE-SSM (Zhong et al., 2024) introduces an FFT-based boundary compression (PMBC) algorithm to further accelerate spiking SSM inference. By integrating a refractory mechanism and trainable thresholds into the neuron model, SPikE-SSM achieves firing rates as low as $3$–12%12\% on LRA with accuracy comparable to dense SSMs, and achieves 20×20\times lower theoretical energy costs on WikiText-103.

6. Sparse Model Reduction and Data-Driven SSMs

In dynamical systems applications, SparseSSM methodologies focus on exact or approximate model reduction exploiting spectral structure and data-driven sparsity.

Axås–Cenedese–Haller's fastSSM algorithms (Axås et al., 2022) automatically reduce a high-dimensional nonlinear system to an invariant slow spectral submanifold (SSM), fitting the lowest-dimensional sparse polynomial normal form compatible with the data. The approach couples tangent-plane SVD, explicit polynomial regression, and (optionally) 1\ell_1-penalized sparse identification. Empirical benchmarks report 400×400\times speedup over implicit optimization approaches and sub-2%2\% model error on various mechanical and experimental testbeds.

7. Outlook and Extensions

SparseSSM research continues to expand across both theoretical and applied dimensions. Key open directions include:

  • Unifying controllable, observable, and stable SSM constructions in large-scale pretraining (as in Mamba3).
  • Extending sparse parameterizations to multi-input, multi-output, or time-varying SSMs.
  • Bridging between discrete-time, continuous-time, and hybrid models with provable Lyapunov stability.
  • Exploring structured sparsity beyond token and weight dimensions, including group-wise, block-structured, and graph-based schemes.

Through diverse technical mechanisms—canonical parameterization, pruning, Bayesian identification, algorithmic structuring, and bio-inspired quantization—SparseSSM defines a paradigm for efficient, interpretable, and highly scalable sequence modeling and dynamical system analysis (Hamdan et al., 2024, Tuo et al., 11 Jun 2025, 2505.20698, Cox et al., 2023, Cox et al., 2024, Terzić et al., 26 Sep 2025, Zhong et al., 2024, Shen et al., 2024, Axås et al., 2022).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseSSM.