Sparse-Mamba: Efficient Sparse SSMs

Updated 23 February 2026

Sparse-Mamba is a suite of state-space model innovations that integrates adaptive sparsification, canonical low-rank parameterizations, and control-theoretic guarantees.
It leverages techniques such as spatio-temporal token selection and gradient-aware pruning to reduce computational complexity while maintaining robust performance.
Empirical results demonstrate substantial FLOPs reduction and preserved accuracy across NLP, time series, computer vision, and hyperspectral data analysis.

Sparse-Mamba (S-Mamba) encompasses a suite of architectural innovations, sparsification strategies, and resource efficiency techniques for state-space models (SSMs) built atop the Mamba framework. These methods integrate adaptive sparsity, canonical low-rank parameterizations, control-theoretic structure, and task-specific sparse token selection to achieve state-of-the-art performance across natural language processing, time series forecasting, computer vision, and hyperspectral data analysis, with reduced computational requirements and improved robustness.

1. Foundations of Sparse-Mamba: Model Classes and Architectures

Sparse-Mamba originated as an evolution of the Mamba SSM (“Linear-Time Sequence Modeling with Selective State Spaces”), which replaces attention-based mixing with learned continuous-time SSMs carrying O(L·n) complexity for sequence of length L and state size n. The canonical S-Mamba variants introduce control-theoretic structural constraints and sparse parameterizations to the Mamba block (Hamdan et al., 2024), while other S-Mamba instantiations apply adaptive input-dependent or learned sparsification mechanisms at the token, channel, or window level (Yang et al., 21 Jan 2025, Shihab et al., 13 May 2025, Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025, Liu et al., 31 Mar 2025).

Architectural variants include:

SC-Mamba: Controllable canonical form for A, B, C ensuring the controllability matrix is full-rank.
SO-Mamba: Observable canonical form, guaranteeing observability.
S-Mamba2 (ST-Mamba2): Stability-enforced diagonal parameterization for A.
Sparse-activation and deformable-sequencing Mamba: Modules that select sparse token subsets for SSM or Mamba mixing, used for event-based vision, HSI, and large-scale time series.

S-Mamba architectures consistently remove residual attention and MLP blocks, operating with strictly SSM-based mixing and, where relevant, sparsification imposed at the SSM block or token preselection stages (Hamdan et al., 2024, Yang et al., 21 Jan 2025).

2. Canonical Forms, Sparsity, and Control-Theoretic Guarantees

Structural sparsity in S-Mamba is achieved through canonical matrix forms. The controllable canonical form (used in SC-Mamba) for the n×n state matrix A is:

$A = \begin{bmatrix} 0 & 1 & 0 & \dots & 0 \ \vdots & \ddots & \ddots & \ddots & \vdots \ 0 & \dots & 0 & 1 & 0 \ -a_{n-1} & \dots & -a_1 & -a_0 \end{bmatrix}$

Here, only the last row comprises trainable parameters ( $a_0, ..., a_{n-1}$ ); remaining entries are fixed, yielding exactly n learned parameters and $2n-1$ non-zeros per A (Hamdan et al., 2024). Observable canonical form (SO-Mamba) shifts sparsity to the last column. Input and output maps (B and C) are chosen to enforce rank constraints for controllability or observability, making the structural property inherent to the architecture.

Stability is enforced (ST-Mamba2) by clamping any non-negative diagonal entry of A to a small negative constant, guaranteeing all eigenvalues are strictly within the unit circle in discrete time (Hamdan et al., 2024).

Sparsity statistics: | Model | Parameters | |----------------|-------------------| | Mamba | 64,475,648 | | SO-Mamba | 64,352,904 | | SC-Mamba | 64,344,840 |

Parameter reduction is realized by exploiting canonical sparsity while maintaining empirical performance or improving it.

3. Adaptive Input-Level and Layerwise Sparsification

Modern S-Mamba architectures employ adaptive, input-driven or learnable sparsification mechanisms as follows:

Spatio-Temporal Continuity Assessment (STCA) (Yang et al., 21 Jan 2025): Computes the information content of spatial-temporal tokens from events in NEU datasets, applying a Gaussian-weighted continuity score, and uses an adaptive threshold to generate a binary keep/discard mask D. Only tokens with sufficient event information are forwarded to subsequent stages.
Information-Prioritized Local Scan (IPL-Scan) (Yang et al., 21 Jan 2025): Within each spatial window, tokens are reordered so those scoring highest on Sˢᵗ (spatiotemporal continuity) interact first during the SSM scan, enhancing the propagation of salient information.
Sparse Deformable Sequencing (SDS / SDMS) (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025): In image or time-series inputs, tokens receive a cosine-similarity–based or attention-based relevance score, and only the top-λ fraction is processed by Mamba blocks, with the rest omitted. This can be done independently for spatial, spectral, and temporal tokens.
Sparse Spatial Activation (Liu et al., 31 Mar 2025): Visual Mamba with a sparsity module selects top-scoring spatial windows (via $L_2$ norm of patches) per layer, masking the remainder, and running SSM blocks only on the retained set.

Efficiency arises as these methods reduce the effective token count per layer from $N$ to $rN$ with $r \ll 1$ , substantially decreasing computation and memory without loss of global context due to linear-time mixing (Yang et al., 21 Jan 2025, Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).

4. Weight Pruning and Resource Optimization

Sparse-Mamba admits unstructured parameter pruning for further resource savings, targeting deployment in constrained environments. The pruning procedure is as follows (Shihab et al., 13 May 2025):

Gradient-Aware Magnitude Scoring: Each parameter’s importance $S(w_i)$ is computed via the product of magnitude and the corresponding gradient raised to an exponent $\alpha$ :

$S(w_i) = |w_i| \cdot \left|\frac{\partial \mathcal{L}}{\partial w_i}\right|^\alpha$

$\alpha \approx 1$ is optimal for language modeling; smaller values yield higher accuracy degradation.

Iterative Cubic Pruning Schedule: Sparsity is increased progressively per iteration $t$ ,

$s_t = s_f + (s_0 - s_f)\left(1 - \frac{t-t_0}{T-t_0}\right)^3$

enabling model stability and finer capacity control, starting after 25% of training steps.

Global Thresholding: A single threshold $\tau_t$ is applied across all layers such that the total number of parameters with $S(w_i) \leq \tau_t$ matches target sparsity $s_t$ , automatically allocating higher densities to SSM-critical blocks versus linear projections.
Stability Preservation: For every SSM-block matrix $A$ , eigenvalues are controlled so that $|\lambda| < 1$ , rolling back pruning in the rare event this is violated.

Under this regime, up to 70% sparsity is achieved with <10% perplexity increase, and performance retention above 95% on WikiText-103, Long Range Arena, and ETT time-series (Shihab et al., 13 May 2025).

5. Specialized Modules for Channel and Multidimensional Mixing

Sparse-Mamba adopts modules for efficient cross-channel and cross-mode interactions:

Global Channel Interaction (GCI) (Yang et al., 21 Jan 2025): Aggregates spatially sparse features via a bidirectional S6 scan over channels, plus $1\times 1$ convolution for local-global channel fusion, at $O(C \cdot H \cdot W)$ complexity.
Multi-branch sparse Mamba: Applied in hyperspectral and MODIS data classification, using SDMS to process sparse, deformably-ordered tokens in the spectral, spatial, and temporal domains through dedicated Mamba blocks before attention-based or MLP fusion (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).

Each module preserves global connectivity but incurs substantially reduced FLOPs by focusing channel and token mixing only on structurally or information-theoretically salient subsets.

6. Empirical Performance and Efficiency

S-Mamba models consistently match or exceed the accuracy of dense Mamba or transformer-based approaches while achieving pronounced computational savings:

On Gen1, 1Mpx, and eTram event data, S-Mamba achieves mAP of 50.4%, 49.3%, and 32.6% with 20–31% FLOPs reduction compared to SSM baselines (Yang et al., 21 Jan 2025).
On hyperspectral data (Indian Pines, Pavia University), SDMamba reaches OA/AA/Kappa scores above 99.1% with ~60% reduced FLOPs (Xu et al., 13 Apr 2025).
In MODIS time-series, STSMamba improves OA on out-of-domain Alberta data by 7% relative over MambaHSI and halves parameter count and runtime (Dewis et al., 29 Jul 2025).
In micro-expression recognition, Sparse Mamba yields up to 10 uniform F1/UAR improvement over non-sparse SSM vision models while reducing runtime by ~28%; combined sparse and motion-magnification modules confer the greatest benefits (Liu et al., 31 Mar 2025).
In language modeling and time series, gradient-aware pruning allows 50–70% parameter reduction at <2% average drop in perplexity or MSE, frequently with improved robustness to input perturbation (Shihab et al., 13 May 2025).

Representative empirical results for event-based detection:

Method	Gen1 mAP	1Mpx mAP	FLOPs (Gen1/1Mpx)	Parameters	Runtime (Gen1)
SAST-CB	48.2	48.7	2.4G / 6.4G	18.9M	22.7 ms
SSM+RNN baseline	50.0	48.8	3.1G / 9.5G	16.1–16.7M	25.2 ms
S-Mamba	50.4	49.3	2.4G / 7.4G	16.1–16.7M	24.0 ms

(Yang et al., 21 Jan 2025)

7. Interpretability, Robustness, and Theoretical Implications

S-Mamba’s explicit imposition of controllability, observability, and stability guarantees (via canonical matrix forms and spectral clamping) yields interpretable and robust sequence models. Experiments indicate:

SSM weights are not uniformly important; 20% of weights account for ~80% of importance as measured by the gradient-aware score (Shihab et al., 13 May 2025).
Pruned S-Mamba models exhibit increased robustness to adversarial or noisy input perturbations, attributed to the regularization effect of gradient-aware sparsity selection.
Ablations confirm the necessity of non-uniform (global) importance thresholding and controlled pruning rates for optimal accuracy preservation (Shihab et al., 13 May 2025).

Adopting control-theoretic designs obviates the need for runtime regularizers or heuristic stabilization, potentially simplifying scaling to larger or more structured SSM families ("Mamba3") (Hamdan et al., 2024).

References:

(Yang et al., 21 Jan 2025): SMamba: Sparse Mamba for Event-based Object Detection
(Shihab et al., 13 May 2025): Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
(Hamdan et al., 2024): Sparse Mamba: Introducing Controllability, Observability, And Stability To Structural State Space Models
(Xu et al., 13 Apr 2025): Sparse Deformable Mamba for Hyperspectral Image Classification
(Dewis et al., 29 Jul 2025): Spatial-Temporal-Spectral Mamba with Sparse Deformable Token Sequence for Enhanced MODIS Time Series Classification
(Liu et al., 31 Mar 2025): AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition
(Wang et al., 2024): Is Mamba Effective for Time Series Forecasting?