Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba Selective SSM Architecture

Updated 3 February 2026
  • Mamba Selective SSM is a dynamic state-space model that uses input-dependent matrices and gating to tailor state transitions for each token.
  • It employs parameter generation networks and dual selective projectors to enable adaptive memory and robust processing across diverse applications.
  • The architecture supports aggressive compression, structured pruning, and hardware-efficient computations while maintaining constant model size.

The Mamba Selective State-Space Architecture (Selective SSM) is a parameterized, input-dependent state-space model that achieves expressive, content-aware sequence processing with strict linear-time complexity. It represents a major evolution beyond classical linear SSMs (e.g., S4) by dynamically adjusting both state transitions and input gates based on the current sample or token, fundamentally enabling adaptive memory and context propagation with a fixed architecture. This design permits efficient processing of long sequences, satisfies hardware constraints, and supports a wide range of domains—from class-incremental learning to compression and pruning, multimodal reasoning, and graph processing.

1. Mathematical Formulation and Core Principles

A traditional linear time-invariant (LTI) SSM is defined for a sequence of inputs xtRdx_t \in \mathbb{R}^d and hidden state htRDh_t \in \mathbb{R}^D by

ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t,\quad y_t = C h_t

where Aˉ,Bˉ,C\bar{A}, \bar{B}, C are fixed, learned matrices (with Aˉ,Bˉ\bar{A}, \bar{B} typically derived from the zero-order hold discretization of a continuous-time system).

The Mamba Selective SSM generalizes this by making the matrices input-dependent: Aˉt=fA(xt),Bˉt=fB(xt),Ct=fC(xt)\bar{A}_t = f_A(x_t),\quad \bar{B}_t = f_B(x_t),\quad C_t = f_C(x_t) As a result, for each token in the sequence, the actual state transition and update are functions of the current content, allowing dynamic selection of what information to propagate, forget, or inject. The selective mechanism is further enhanced by learned gating functions (e.g., gt=σ(Wgxt+Ught1+bg)g_t = \sigma(W_g x_t + U_g h_{t-1} + b_g) in Mamba-Shedder (Muñoz et al., 28 Jan 2025)), so updates can be partially or fully inhibited for any part of the state.

For images or 2D data, the SS2D extension applies the S6 dynamics along multiple diagonal scan directions (top-down, bottom-up, left-right, right-left) and sums the representations, preserving angular isotropy and global receptive field coverage (Li et al., 2024).

2. Dual Selective SSM Projector and Class-Sensitive Mechanisms

Mamba-FSCIL (Li et al., 2024) introduces a dual selective SSM projector with three structurally decoupled branches:

  • Identity branch (frozen after base training)
  • Base-class SSM branch gbaseg^\mathrm{base} (frozen after base training)
  • Incremental-class SSM branch gincg^\mathrm{inc} (learned only during incremental sessions)

The processing pipeline for input xRN×D×H×Wx \in \mathbb{R}^{N \times D \times H \times W} comprises reshaping, linear projection and learned positional encoding, splitting into scan and gate streams, computation of per-sample, per-direction SSM parameters for each spatial position: An,k,:,=fA(X^),Cn,k,:,=fC(X^),Δn,k,:,=fΔ(X^)A_{n, k, :, \ell} = f_A(\hat{X}), \quad C_{n, k, :, \ell} = f_C(\hat{X}), \quad \Delta_{n, k, :, \ell} = f_\Delta(\hat{X}) Depthwise convolution, gating (e.g., SiLU activation), scan over diagonal directions (SS2D), and average pooling yield per-branch representations.

The class-sensitive scan mechanism:

  • Suppression loss forces gincg^\mathrm{inc} output to vanish on base classes, while maximally adapting for novel classes: Lsupp=cBμcinc2iNμiinc2L_\mathrm{supp} = \sum_{c \in \mathcal{B}} \|\mu^\mathrm{inc}_c\|^2 - \sum_{i \in \mathcal{N}} \|\mu^\mathrm{inc}_i\|^2
  • Separation loss enforces decorrelation (orthogonality) of parameter subspaces for base versus novel classes: Lsep=cos(AˉB,AˉN)+cos(CˉB,CˉN)+cos(ΔˉB,ΔˉN)L_\mathrm{sep} = \cos(\bar{A}_\mathcal{B}, \bar{A}_\mathcal{N}) + \cos(\bar{C}_\mathcal{B}, \bar{C}_\mathcal{N}) + \cos(\bar{\Delta}_\mathcal{B}, \bar{\Delta}_\mathcal{N})

The overall objective in incremental sessions combines dot-regression loss (with fixed ETF classifier), suppression, and separation losses: Ltotal=Lcls+αLsupp+βLsepL_\mathrm{total} = L_\mathrm{cls} + \alpha L_\mathrm{supp} + \beta L_\mathrm{sep} Hyperparameters are tuned (e.g., α\alpha in [50,200][50,200], β\beta in [0.05,0.5][0.05,0.5]), and all adaptation proceeds within a fixed parameter budget.

3. Fixed-Architecture Adaptation and Computational Complexity

Selective SSMs operate within a fixed model size, never expanding parameter count even as new data distributional shifts or classes are seen (Li et al., 2024, Gu et al., 2023):

  • Parameter generation networks fA,fB,fCf_A, f_B, f_C (e.g., 1×1 convolutions or small MLPs) transform each input into SSM parameters "on the fly".
  • At inference, only a static set of networks are used, but their output and forward dynamics are fully content-adaptive, distinguished by learned branches and class-sensitive losses.
  • The scanning and recurrence cost is strictly O(LD)O(L D) per selective SSM layer, matching or surpassing attention mechanisms only for long sequences. Shared SSM kernels and pooled operations ensure linear scaling in sequence length.
  • Fused parallel scan implementations (SRAM-local, see (Gu et al., 2023)) further optimize hardware utilization, with constant per-token inference speed.

4. Compression, Pruning, and Structured Sparsity

Selective SSMs support aggressive compression and pruning operations:

  • Structured pruning (Mamba-Shedder, PerfMamba, SparseSSM) (Muñoz et al., 28 Jan 2025, Asif et al., 28 Nov 2025, Tuo et al., 11 Jun 2025): Importance scores (e.g., increase in perplexity when a block/module/channel is zeroed) identify components yielding minimal loss under pruning. Block-level, module-level, and width-wise sparsity schemes are introduced.
  • Theoretical scaling: Pruning a fraction pp of blocks and a fraction qq of SSM modules yields adjusted FLOPs

FLOPspruned(1p)(1q)Lc(TD)\mathrm{FLOPs}_{\mathrm{pruned}} \approx (1-p)(1-q) L c (T D)

Users observe up to 1.14×1.14 \times speedup and 11.5%11.5\% memory reduction before fine-tuning, with negligible accuracy degradation under moderate pruning regimes.

  • OBS-inspired sensitivity (SparseSSM): Pruning 50% of SSM weights with second-order saliency (from Hessian trace) achieves no zero-shot accuracy loss, outperforming post-training-attention pruning.

5. Theoretical Implications and Token Dynamics

Recent work demystifies the token-level dynamics of selective SSMs (Vo et al., 2024):

  • Discrete-time S6 blocks exhibit explicit scenarios in the continuous limit, where either all tokens converge (collapse to zero), or diverge at different rates (heterogeneous update contributions). The convergence regime is deleterious for representation fidelity/predictive power.
  • Practical refinements include imposing positive-definite input-output mappings at initialization and token reordering by divergence speed (learned "importance score" via SoftSort), boosting generalization and convergence.
  • Input selectivity enhances function approximation (e.g., Haar wavelet bases) and can counteract memory decay beyond the limitations of diagonal SSMs (Huang et al., 13 Jun 2025).

6. Applications Across Domains

The Selective SSM paradigm is broadly instantiated:

7. Limitations, Open Questions, and Ongoing Directions

Key avenues based on current findings:

  • Selectivity mechanism design: Control-theoretic LTI residual schemes can match or surpass Mamba’s selectivity on synthetic benchmarks, with better convolutional structure and stability (Casti et al., 23 May 2025).
  • Robustness and optimality: Information-theoretic regularization (MPS principle) aligns selectivity with predictive sufficiency and minimality, filtering out spurious historical dependencies (Wang et al., 5 Aug 2025).
  • Scaling and hardware efficiency: Structured pruning and fused scan operations are central to low-latency, low-memory deployment, with research into cross-layer state sharing and adaptive routing.
  • Theory and function space: Analytical constructions relate Mamba’s selectivity to wavelet and piecewise basis approximation, associative recall, and long-term memory retention.
  • Application diversity: Variants continue to emerge in class-incremental learning, few-shot recognition, multimodal translation, graph reasoning, and other domains.

In summary, the Mamba Selective State-Space Architecture leverages dynamic, input-conditioned state transitions, channel-wise gating, and modular scan strategies to achieve expressive, adaptive sequence modeling in a fixed, hardware-efficient architecture. Its extensibility, pruning resilience, and theoretical richness have established it as a foundation for contemporary sequence modeling across language, vision, audio, graph, and spatio-temporal data (Li et al., 2024, Muñoz et al., 28 Jan 2025, Asif et al., 28 Nov 2025, Gu et al., 2023, Vo et al., 2024, Wang et al., 2024, Atli et al., 2024, Wang et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba Selective State-Space Architecture (Selective SSM).