Mamba-SSM: Input-Driven Selective SSMs

Updated 18 January 2026

Mamba-SSM is a selective state-space model that uses input-dependent parameters to enable content-aware, linear-time sequence processing.
It achieves efficient long-range memory retention and dynamic adaptation through a parallel selective scan algorithm, improving over traditional transformers.
Mamba-SSM's versatility is demonstrated across language, vision, audio, and time-series tasks with robust performance and hardware-aware optimizations.

Mamba-SSM

Mamba-SSM denotes the class of selective state-space models (SSMs) characterized by input-dependent, hardware-aware, linear-time sequence modeling with strong empirical and theoretical properties. As a scalable alternative to transformers, Mamba-SSM can be instantiated as a generic backbone for natural language, vision, audio/speech, time-series forecasting, multimodal fusion, and other long-sequence domains. Its key innovation lies in parameterizing SSM transitions and readouts as functions of the input at each step, endowing the architecture with both content awareness (selectivity) and computational efficiency—typically realized via a parallel “selective scan” algorithm (Gu et al., 2023).

1. Mathematical Foundations and Selective State-Space Formulation

Mamba-SSM extends the classical linear state-space model to permit transition and readout matrices that are functions of the input sequence. The model in continuous time is specified by: $\dot h(t) = A\,h(t) + B\,x(t),\qquad y(t) = C\,h(t) + D\,x(t)$ where $h(t)$ is the latent state, $x(t)$ the input, $y(t)$ the output, and $A,B,C,D$ are the system parameters (Gu et al., 2023, Liu et al., 2024).

Discretization via zero-order hold (step $\Delta$ ) yields: $h_{t} = \overline{A}_t h_{t-1} + \overline{B}_t x_t,\qquad y_t = C_t h_t + D_t x_t$ with input-dependent, time-varying parameters: $\overline{A}_t = \exp(\Delta_t A(x_t)),\quad \overline{B}_t = \mathrm{Discretize}(\Delta_t, A(x_t), B(x_t))$ The selection mechanism specifies that $A_t,B_t,C_t,\Delta_t$ are computed as projections or small neural networks of $x_t$ , e.g., $B_t = W_B x_t$ .

For diagonal (channel-wise) parameterization, the computational cost per token is $O(N)$ or $O(ND)$ for sequence length $L$ and hidden size $N$ , model dimension $D$ (Gu et al., 2023, Liu et al., 2024).

This selective mechanism enables the model to dynamically adjust memory and information flow, strictly subsuming previously fixed SSMs (e.g., S4, S4D), and is predicted by dynamical systems theory to be both content-aware and provably robust (see Section 3).

2. Architectural and Algorithmic Properties

A canonical Mamba block (termed “S6”) embeds the selective SSM within a neural mixer/fusion setup. The pipeline is as follows (Gu et al., 2023, Huang et al., 13 Jun 2025):

Apply normalization (LayerNorm or RMSNorm) to input $X$ .
Project $X$ to separate streams: one fed directly, the other processed through a depthwise convolution (capturing local context).
Compute SSM parameters (e.g., via linear projections from the convolution output).
Run the selective SSM recurrence in parallel per batch, channel, and token position:

$h_t = A_t h_{t-1} + B_t (\Delta_t \odot \hat u_t)$

Compute the read-out and gating:

$y_t = g_t \odot (C_t h_t),\quad g_t = \sigma(W_g \hat u_t)$

Fuse with additional pointwise (MLP and/or gating) nonlinearities and apply residual connections.

Hardware-aware, linear-time “selective scan” implementations fuse parameter generation, recurrence, and output projection into a single operator, storing only the minimal activations and using prefix-scan parallelism to exploit GPU/TPU architectures (Gu et al., 2023, Baruah et al., 25 Aug 2025). State-dimension $N$ is typically chosen to trade off memory and capacity.

3. Theoretical Properties: Memory, Approximation, Stability

Memory and Long-Term Dependency

Unlike time-invariant (LTI) SSMs (e.g., S4D), which are constrained to exponential memory decay, the content-dependent gating in Mamba-SSM allows for active suppression of decay, enabling selective retention of information. For the S6 recurrence: $h_t = A_t h_{t-1} + B_t u_t$ where $A_t = \exp(-\Lambda \Delta_t)$ , the “forgetting” factor $\Delta_t$ can be set to zero for any desired interval, effectively “freezing” memory and allowing perfect recall of past tokens—a strict superset of RNN and S4D behavior (Huang et al., 13 Jun 2025).

Approximation Capacity

Mamba-SSM can approximate discontinuous functions (such as step and Haar wavelet projections) far more efficiently than LTI/diagonal SSMs. The S6-selectivity mechanism provides piecewise function approximation at exponential rates in the number of learned bases, where S4D’s rate is at best polynomial (Huang et al., 13 Jun 2025). This underlies empirical superior performance on tasks requiring piecewise or local memory (e.g., associative recall, selective copying).

Lyapunov Stability and Robustness

The discrete dynamical system defining a Mamba-SSM block is shown to possess non-positive maximal Lyapunov exponents ( $\lambda_{\max} \le 0$ ), provided diagonal entries of $A$ are nonpositive and the gating is bounded. This ensures that model outputs remain robust under small perturbations (e.g., mixed-precision quantization or noise) and inhibits exponential divergence in recurrent updates—a property not shared by transformers (Halloran et al., 2024).

4. Empirical Performance and Applications

Language Modeling and Long-Sequence Generation

Mamba-SSM achieves competitive or superior performance to transformers in language modeling (e.g., The Pile, OpenWebText), closing the gap in perplexity, zero-shot, and few-shot metrics—even as sequence context grows from thousands to millions. Notably, the architecture exhibits higher throughput and reduced memory overhead in autoregressive inference due to its stateful, linear-time recurrence (Gu et al., 2023, Halloran et al., 2024).

Vision, Remote Sensing, and Multimodal Fusion

In computer vision, Mamba-SSM-based backbones (e.g., Vim-Tiny, VMamba, LocalVMamba) deliver top-1 ImageNet accuracy comparable to ViT and CNNs, and outperform on large-output tasks (detection, segmentation) where long-range spatial context is critical (Liu et al., 2024, Zhang et al., 2024). In remote sensing, tailored scan strategies and SSM-CNN hybrids extend Mamba’s applicability to hyperspectral classification, semantic segmentation, super-resolution, and change detection, with linear-scaling cost (Bao et al., 1 May 2025). For multi-modal fusion, Mamba-based architectures have been shown to effectively couple cross-modal state evolutions, leading to improved F1, inference speed and memory efficiency (Li et al., 2024).

Speech and Audio

Mamba-SSM encoders and decoders deliver competitive or exceeding WER and MOS in ASR and TTS compared to SOTA Transformer variants (e.g., Conformer, E-Branchformer), and uniquely tolerate very long-form audio inputs, with consistent inference accuracy and robust runtime scaling (Miyazaki et al., 2024).

Time-Series Forecasting

ss-Mamba (semantic-spline Mamba) demonstrates efficient, interpretable, and robust foundation modeling for time-series, enhancing generalization by combining the selective SSM with semantic-aware index embeddings and spline-based temporal encodings, significantly outperforming Transformer and SSM baselines on key forecasting metrics (Ye, 3 Jun 2025).

Medical Imaging and Tracker Tasks

MambaXCTrack leverages SSM cross-correlation modules for ultrasound needle tracking, achieving superior accuracy, robustness to visibility loss, and real-time performance relative to convolutional or transformer-based trackers (Zhang et al., 2024). Analysis on medical imaging reveals natural hierarchical refinement and controllability signatures in Vision Mamba SSMs, with interpretable, spatially-distributed influence maps (Mabrok et al., 16 Nov 2025).

5. Complexity, Memory, and Hardware Implementation

The computational cost for selective scan in Mamba-SSM is $O(L N)$ for sequence length $L$ and state dimension $N$ (or $O(L N D)$ for $D$ channels), in contrast to the $O(L^2 D)$ cost of transformers. Kernel fusion and parallelization across GPU memory hierarchies allow SSM-kernels to leverage low-latency SRAM and vector pipelines; blockwise materialization trades increased memory for higher I/O throughput (Baruah et al., 25 Aug 2025, Asif et al., 28 Nov 2025).

Ablation studies and kernel profiling reveal that the SSM kernel dominates decoder runtime and resource consumption; pruning up to 30% of low-activity states offers tangible throughput and memory benefits with minimal accuracy loss (Asif et al., 28 Nov 2025). Emerging FPGA-optimized variants demonstrate >2x speedup and >5x energy efficiency over GPU baselines for autoregressive inference (Zhong et al., 24 Sep 2025).

6. Architectural Variants and Hybridization

Variants of Mamba-SSM extend the model’s capabilities along several axes:

Hierarchical spatial context (Hi-Mamba, hierarchical/region SSM blocks and multi-scale alternation) achieves state-of-the-art PSNR for image super-resolution without multi-direction scanning overhead (Qiao et al., 2024).
Locally bi-directional Mamba (LBMamba) embeds a lightweight backward-scan in the forward pass, securing bi-directional context at near-single-pass cost and dominating the throughput-accuracy Pareto frontier (Zhang et al., 19 Jun 2025).
Mamba-2 (SSD duality) integrates attention-style quadratic mixers with the SSM recurrence, increasing performance for certain associative recall and memory tasks (Huang et al., 13 Jun 2025).
Multimodal and remote sensing hybrids couple SSMs across modalities or along tree-structured and windowed paths, enabling cross-domain fusion and domain-specific optimization (Bao et al., 1 May 2025, Li et al., 2024).
Vision-Mamba and MambaOut establish that SSM mixers are most beneficial for long-sequence or causal vision tasks and superfluous for feedforward, short-sequence settings (e.g., image-classification on ImageNet) (Yu et al., 2024).

7. Limitations, Interpretability, and Future Perspectives

Despite their strengths, Mamba-SSMs have limitations and open questions:

For non-causal, feedforward tasks with short flattened sequences (e.g., image classification), the additional SSM complexity is unnecessary, with convolutional Gated-CNN models (“MambaOut”) often outperforming full Mamba-SSM backbones (Yu et al., 2024).
2D and 3D data processing using sequence flattening breaks spatial isotropy; research in 2D SSMs or mixed scan strategies continues (Bao et al., 1 May 2025, Qiao et al., 2024).
SSM-selectivity mechanisms lack clear attribution interpretability—although recent work introduces Jacobian- and Gramian-based controllability maps, offering single-pass, fine-grained insight into patch or token influence (Mabrok et al., 16 Nov 2025).
Quantization and ultra-low-precision variants remain underexplored, though hardware-aware design is ongoing (Baruah et al., 25 Aug 2025, Zhong et al., 24 Sep 2025).
Scaling SSMs to 100M–1B+ parameters for multi-modal, high-resolution settings is an ongoing challenge, requiring architectural, microarchitectural, and optimization advances (Bao et al., 1 May 2025).

A plausible implication is that Mamba-SSM architectures will continue to form the basis of sequence modeling at a broad array of scales, especially as they integrate domain-adaptive scan, hybridization with attention/mechanisms, and memory-aware deployment. Their stability, interpretability, and efficiency properties—together with the breadth of empirical achievement—anchor selective SSMs as a core paradigm for next-generation foundation models.

References: (Gu et al., 2023, Liu et al., 2024, Huang et al., 13 Jun 2025, Halloran et al., 2024, Asif et al., 28 Nov 2025, Yu et al., 2024, Zhang et al., 2024, Zhang et al., 2024, Qiao et al., 2024, Bao et al., 1 May 2025, Ye, 3 Jun 2025, Baruah et al., 25 Aug 2025, Zhong et al., 24 Sep 2025, Mabrok et al., 16 Nov 2025, Li et al., 2024, Zhang et al., 19 Jun 2025, Miyazaki et al., 2024).

Markdown Upgrade to Chat

References (17)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023)

Vision Mamba: A Comprehensive Survey and Taxonomy (2024)

Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity (2025)

Characterizing the Behavior of Training Mamba-based State Space Models on GPUs (2025)

Mamba State-Space Models Are Lyapunov-Stable Learners (2024)

A Survey on Visual Mamba (2024)

Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook (2025)

Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model (2024)

Exploring the Capability of Mamba in Speech Applications (2024)

10.

ss-Mamba: Semantic-Spline Selective State-Space Model (2025)

11.

MambaXCTrack: Mamba-based Tracker with SSM Cross-correlation and Motion Prompt for Ultrasound Needle Tracking (2024)

12.

X-VMamba: Explainable Vision Mamba (2025)

13.

PerfMamba: Performance Analysis and Pruning of Selective State Space Models (2025)

14.

SpecMamba: Accelerating Mamba Inference on FPGA with Speculative Decoding (2025)

15.

Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution (2024)

16.

LBMamba: Locally Bi-directional Mamba (2025)

17.

MambaOut: Do We Really Need Mamba for Vision? (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-SSM.