Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse-Mamba: Efficient Sparse SSMs

Updated 23 February 2026
  • Sparse-Mamba is a suite of state-space model innovations that integrates adaptive sparsification, canonical low-rank parameterizations, and control-theoretic guarantees.
  • It leverages techniques such as spatio-temporal token selection and gradient-aware pruning to reduce computational complexity while maintaining robust performance.
  • Empirical results demonstrate substantial FLOPs reduction and preserved accuracy across NLP, time series, computer vision, and hyperspectral data analysis.

Sparse-Mamba (S-Mamba) encompasses a suite of architectural innovations, sparsification strategies, and resource efficiency techniques for state-space models (SSMs) built atop the Mamba framework. These methods integrate adaptive sparsity, canonical low-rank parameterizations, control-theoretic structure, and task-specific sparse token selection to achieve state-of-the-art performance across natural language processing, time series forecasting, computer vision, and hyperspectral data analysis, with reduced computational requirements and improved robustness.

1. Foundations of Sparse-Mamba: Model Classes and Architectures

Sparse-Mamba originated as an evolution of the Mamba SSM (“Linear-Time Sequence Modeling with Selective State Spaces”), which replaces attention-based mixing with learned continuous-time SSMs carrying O(L·n) complexity for sequence of length L and state size n. The canonical S-Mamba variants introduce control-theoretic structural constraints and sparse parameterizations to the Mamba block (Hamdan et al., 2024), while other S-Mamba instantiations apply adaptive input-dependent or learned sparsification mechanisms at the token, channel, or window level (Yang et al., 21 Jan 2025, Shihab et al., 13 May 2025, Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025, Liu et al., 31 Mar 2025).

Architectural variants include:

  • SC-Mamba: Controllable canonical form for A, B, C ensuring the controllability matrix is full-rank.
  • SO-Mamba: Observable canonical form, guaranteeing observability.
  • S-Mamba2 (ST-Mamba2): Stability-enforced diagonal parameterization for A.
  • Sparse-activation and deformable-sequencing Mamba: Modules that select sparse token subsets for SSM or Mamba mixing, used for event-based vision, HSI, and large-scale time series.

S-Mamba architectures consistently remove residual attention and MLP blocks, operating with strictly SSM-based mixing and, where relevant, sparsification imposed at the SSM block or token preselection stages (Hamdan et al., 2024, Yang et al., 21 Jan 2025).

2. Canonical Forms, Sparsity, and Control-Theoretic Guarantees

Structural sparsity in S-Mamba is achieved through canonical matrix forms. The controllable canonical form (used in SC-Mamba) for the n×n state matrix A is:

A=[0100  0010 an1a1a0]A = \begin{bmatrix} 0 & 1 & 0 & \dots & 0 \ \vdots & \ddots & \ddots & \ddots & \vdots \ 0 & \dots & 0 & 1 & 0 \ -a_{n-1} & \dots & -a_1 & -a_0 \end{bmatrix}

Here, only the last row comprises trainable parameters (a0,...,an1a_0, ..., a_{n-1}); remaining entries are fixed, yielding exactly n learned parameters and $2n-1$ non-zeros per A (Hamdan et al., 2024). Observable canonical form (SO-Mamba) shifts sparsity to the last column. Input and output maps (B and C) are chosen to enforce rank constraints for controllability or observability, making the structural property inherent to the architecture.

Stability is enforced (ST-Mamba2) by clamping any non-negative diagonal entry of A to a small negative constant, guaranteeing all eigenvalues are strictly within the unit circle in discrete time (Hamdan et al., 2024).

Sparsity statistics: | Model | Parameters | |----------------|-------------------| | Mamba | 64,475,648 | | SO-Mamba | 64,352,904 | | SC-Mamba | 64,344,840 |

Parameter reduction is realized by exploiting canonical sparsity while maintaining empirical performance or improving it.

3. Adaptive Input-Level and Layerwise Sparsification

Modern S-Mamba architectures employ adaptive, input-driven or learnable sparsification mechanisms as follows:

  • Spatio-Temporal Continuity Assessment (STCA) (Yang et al., 21 Jan 2025): Computes the information content of spatial-temporal tokens from events in NEU datasets, applying a Gaussian-weighted continuity score, and uses an adaptive threshold to generate a binary keep/discard mask D. Only tokens with sufficient event information are forwarded to subsequent stages.
  • Information-Prioritized Local Scan (IPL-Scan) (Yang et al., 21 Jan 2025): Within each spatial window, tokens are reordered so those scoring highest on Sˢᵗ (spatiotemporal continuity) interact first during the SSM scan, enhancing the propagation of salient information.
  • Sparse Deformable Sequencing (SDS / SDMS) (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025): In image or time-series inputs, tokens receive a cosine-similarity–based or attention-based relevance score, and only the top-λ fraction is processed by Mamba blocks, with the rest omitted. This can be done independently for spatial, spectral, and temporal tokens.
  • Sparse Spatial Activation (Liu et al., 31 Mar 2025): Visual Mamba with a sparsity module selects top-scoring spatial windows (via L2L_2 norm of patches) per layer, masking the remainder, and running SSM blocks only on the retained set.

Efficiency arises as these methods reduce the effective token count per layer from NN to rNrN with r1r \ll 1, substantially decreasing computation and memory without loss of global context due to linear-time mixing (Yang et al., 21 Jan 2025, Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).

4. Weight Pruning and Resource Optimization

Sparse-Mamba admits unstructured parameter pruning for further resource savings, targeting deployment in constrained environments. The pruning procedure is as follows (Shihab et al., 13 May 2025):

  • Gradient-Aware Magnitude Scoring: Each parameter’s importance S(wi)S(w_i) is computed via the product of magnitude and the corresponding gradient raised to an exponent α\alpha:

S(wi)=wiLwiαS(w_i) = |w_i| \cdot \left|\frac{\partial \mathcal{L}}{\partial w_i}\right|^\alpha

α1\alpha \approx 1 is optimal for language modeling; smaller values yield higher accuracy degradation.

  • Iterative Cubic Pruning Schedule: Sparsity is increased progressively per iteration tt,

st=sf+(s0sf)(1tt0Tt0)3s_t = s_f + (s_0 - s_f)\left(1 - \frac{t-t_0}{T-t_0}\right)^3

enabling model stability and finer capacity control, starting after 25% of training steps.

  • Global Thresholding: A single threshold τt\tau_t is applied across all layers such that the total number of parameters with S(wi)τtS(w_i) \leq \tau_t matches target sparsity sts_t, automatically allocating higher densities to SSM-critical blocks versus linear projections.
  • Stability Preservation: For every SSM-block matrix AA, eigenvalues are controlled so that λ<1|\lambda| < 1, rolling back pruning in the rare event this is violated.

Under this regime, up to 70% sparsity is achieved with <10% perplexity increase, and performance retention above 95% on WikiText-103, Long Range Arena, and ETT time-series (Shihab et al., 13 May 2025).

5. Specialized Modules for Channel and Multidimensional Mixing

Sparse-Mamba adopts modules for efficient cross-channel and cross-mode interactions:

  • Global Channel Interaction (GCI) (Yang et al., 21 Jan 2025): Aggregates spatially sparse features via a bidirectional S6 scan over channels, plus 1×11\times 1 convolution for local-global channel fusion, at O(CHW)O(C \cdot H \cdot W) complexity.
  • Multi-branch sparse Mamba: Applied in hyperspectral and MODIS data classification, using SDMS to process sparse, deformably-ordered tokens in the spectral, spatial, and temporal domains through dedicated Mamba blocks before attention-based or MLP fusion (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).

Each module preserves global connectivity but incurs substantially reduced FLOPs by focusing channel and token mixing only on structurally or information-theoretically salient subsets.

6. Empirical Performance and Efficiency

S-Mamba models consistently match or exceed the accuracy of dense Mamba or transformer-based approaches while achieving pronounced computational savings:

  • On Gen1, 1Mpx, and eTram event data, S-Mamba achieves mAP of 50.4%, 49.3%, and 32.6% with 20–31% FLOPs reduction compared to SSM baselines (Yang et al., 21 Jan 2025).
  • On hyperspectral data (Indian Pines, Pavia University), SDMamba reaches OA/AA/Kappa scores above 99.1% with ~60% reduced FLOPs (Xu et al., 13 Apr 2025).
  • In MODIS time-series, STSMamba improves OA on out-of-domain Alberta data by 7% relative over MambaHSI and halves parameter count and runtime (Dewis et al., 29 Jul 2025).
  • In micro-expression recognition, Sparse Mamba yields up to 10 uniform F1/UAR improvement over non-sparse SSM vision models while reducing runtime by ~28%; combined sparse and motion-magnification modules confer the greatest benefits (Liu et al., 31 Mar 2025).
  • In language modeling and time series, gradient-aware pruning allows 50–70% parameter reduction at <2% average drop in perplexity or MSE, frequently with improved robustness to input perturbation (Shihab et al., 13 May 2025).

Representative empirical results for event-based detection:

Method Gen1 mAP 1Mpx mAP FLOPs (Gen1/1Mpx) Parameters Runtime (Gen1)
SAST-CB 48.2 48.7 2.4G / 6.4G 18.9M 22.7 ms
SSM+RNN baseline 50.0 48.8 3.1G / 9.5G 16.1–16.7M 25.2 ms
S-Mamba 50.4 49.3 2.4G / 7.4G 16.1–16.7M 24.0 ms

(Yang et al., 21 Jan 2025)

7. Interpretability, Robustness, and Theoretical Implications

S-Mamba’s explicit imposition of controllability, observability, and stability guarantees (via canonical matrix forms and spectral clamping) yields interpretable and robust sequence models. Experiments indicate:

  • SSM weights are not uniformly important; 20% of weights account for ~80% of importance as measured by the gradient-aware score (Shihab et al., 13 May 2025).
  • Pruned S-Mamba models exhibit increased robustness to adversarial or noisy input perturbations, attributed to the regularization effect of gradient-aware sparsity selection.
  • Ablations confirm the necessity of non-uniform (global) importance thresholding and controlled pruning rates for optimal accuracy preservation (Shihab et al., 13 May 2025).

Adopting control-theoretic designs obviates the need for runtime regularizers or heuristic stabilization, potentially simplifying scaling to larger or more structured SSM families ("Mamba3") (Hamdan et al., 2024).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse-Mamba (S-Mamba).