Selective Structured State Space Model (Mamba)

Updated 22 June 2026

The paper introduces Mamba, a selective structured state space model that leverages input-dependent parameters and efficient parallel scan algorithms for long-range dependency modeling.
Mamba achieves state-of-the-art throughput by implementing hardware-aware discretization and parallel processing, leading to over 5× faster inference than Transformers.
Empirical results across language, audio, vision, and time series tasks demonstrate significant improvements in efficiency and performance, with reduced compute and memory overhead.

A Selective Structured State Space Model (commonly referred to as "Mamba") is a neural sequence modeling architecture that generalizes classical state-space models (SSMs) by introducing input-dependent, time-varying parameters and efficient parallel scan algorithms. Mamba layers enable content-aware credit assignment, long-range dependency modeling, and linear time complexity. This class of models has rapidly emerged as a scalable foundation for sequence modeling across diverse domains such as language modeling, audio, vision, graph learning, time series, and multimodal tasks.

1. Foundational Principles and Mathematical Formulation

Mamba extends continuous-time linear SSMs by allowing the system's transition, input, and output matrices to become functions of the input at each timestep—implementing a highly expressive form of input-dependence or "selection." The continuous-time SSM dynamics are

$\frac{d\,h(t)}{dt} = A\big(x(t)\big)\,h(t) + B\big(x(t)\big)\,x(t),\qquad y(t) = C\big(x(t)\big)\,h(t)\,,$

where $A(x)\in\mathbb{R}^{N\times N}$ , $B(x)\in\mathbb{R}^{N\times D}$ , and $C(x)\in\mathbb{R}^{D\times N}$ are small neural networks or linear projections conditioned on the current token or feature vector $x(t)$ (Gu et al., 2023, Liu et al., 2024). The model is discretized via zero-order-hold, producing per-step updates

$h_t = \bar{A}_t\,h_{t-1} + \bar{B}_t\,x_t,\qquad y_t = C_t\,h_t$

with $\bar{A}_t=\exp(\Delta_t\,A(x_t))$ , $\bar{B}_t$ a discretized version of $B(x_t)$ , and input-dependent gating or normalization often applied.

This selectivity enables the model to modulate which state dimensions to update or forget (akin to dynamic per-step gating), conferring content sensitivity and the ability to propagate or suppress information as a function of the current input (Gu et al., 2023).

2. Hardware-Aware Implementation and Computational Properties

The input-selective SSM formulation precludes an efficient global convolution for training (as can be done with classical time-invariant SSMs), but the Mamba architecture compensates with a highly optimized, hardware-aware parallel scan algorithm (Gu et al., 2023, Liu et al., 2024). On modern accelerators, the scan is implemented entirely in on-chip memory using fused CUDA kernels, which:

Load all necessary selection projections and discretization parameters for the batch.
Compute all step-wise transition matrices and input projections in parallel.
Apply a segmented, associative scan to efficiently propagate the hidden state throughout the sequence.
Write only final hidden states to main memory; activations are recomputed as needed during backpropagation.

This strategy results in strict linear scaling with sequence length L: per-batch cost is $O(BLDN)$ , with independent operations along the batch and channel axes. During autoregressive inference, the SSM recurrence reduces to single-step updates, yielding state-of-the-art throughput and minimal memory overhead compared to attention-based models. Empirical benchmarks show that Mamba achieves >5× generation throughput over Transformers with equivalent parameter counts (Gu et al., 2023).

3. Selectivity Mechanism and Architectural Variants

The hallmark of Mamba models is the explicit token-wise selectivity, instantiated via learned input-dependent gating vectors or selective projections: $A(x)\in\mathbb{R}^{N\times N}$ 0 where $A(x)\in\mathbb{R}^{N\times N}$ 1 are linear projections, and $A(x)\in\mathbb{R}^{N\times N}$ 2 (typically softplus or sigmoid) enforces positivity for stability (Liu et al., 2024, Gu et al., 2023).

Architectural advances have expanded this design into:

Bidirectional Mamba blocks: Simultaneous forward and backward SSM scans combined via fusion (elementwise addition or concatenation), enhancing contextual modeling (e.g., in speech and SELD tasks) (Jiang et al., 2024, Mu et al., 2024).
Dual-path/Hierarchical Mamba: Modeling both intra-chunk (local) and inter-chunk (global) dependencies in streaming or separation tasks (Jiang et al., 2024).
Spatial extensions: 2D/3D scan orderings, structure-aware state fusion via convolutions, or domain-specific geometric path selection (e.g., Serpentine scan for vessels) for vision and medical imaging (Xiao et al., 2024, Wang et al., 2024).
Mixture-of-Experts (MoE) augmentation: Sparse expert routing interleaved with SSM blocks to vastly expand model capacity with near-constant compute per token (Pióro et al., 2024).
Spline, graph, and semantic index integration: Encoding calendar, relational, or semantic structures for tasks such as time-series forecasting and graph representation learning (Ye, 3 Jun 2025, Li et al., 2024, Yuan et al., 2024).

4. Application Domains and Empirical Performance

Mamba and its selective SSM variants have demonstrated strong empirical results and rapid adoption across a broad set of modeling domains:

Application Domain	Use of Selectivity/SSM	Principal Results
Language modeling	Token-selective SSM	State-of-the-art perplexity, 5× faster inference vs Transformer; robust long-context extrapolation (Gu et al., 2023)
Audio/Speech	Bidirectional, dual-path	SOTA in separation/SELD; up to 5× computational efficiency (Jiang et al., 2024, Mu et al., 2024)
Vision/Image	2D scans, structure fusion	Comparable or better ImageNet performance at reduced FLOPs (Xiao et al., 2024, Liu et al., 2024, Xiao et al., 2024)
Video	Spatio-temporal selective SSM	Linear scaling in frames; competitive to ViT baselines (Park et al., 2024)
Time Series	Semantic & spline enhancements	10–15% RMSE reduction vs Transformer, interpretable seasonal encoding (Ye, 3 Jun 2025)
Graph/Spatio-temporal	Selective SSM on graphs	Robust, efficient, and adversarially resistant dynamic link prediction (Yuan et al., 2024, Li et al., 2024)
Hyperspectral images	Separate spatial/spectral SSM, fusion	Up to 6.7% higher OA/AA vs prior Transformers (Wang et al., 2024)

Notably, in sound event localization and detection (SELD), replacing Transformer's Conformer decoder with bidirectional Mamba blocks yields 5× less computation and ~40% fewer parameters, surpassing state-of-the-art baselines in SELD_score (0.381 vs. 0.407, lower is better) while maintaining or improving joint SED, DoA, and SDE performance (Mu et al., 2024).

5. Model Compression, Pruning, and Efficiency Engineering

Given their scalable recurrence, Mamba models are conducive to various forms of structured pruning and compression. Key contributions include:

Sensitivity-based structured pruning: Selective removal of SSM state channels based on average activity or learned gating, with minimal accuracy loss and up to 1.14× inference speedup and almost 12% memory reduction at aggressive prune rates (Asif et al., 28 Nov 2025, Tuo et al., 11 Jun 2025).
Hardware-aware model surgery: Multi-granular block- and channel-level deletions with post-hoc recovery tuning; effective for both pure-Mamba and hybrid Mamba-Transformer models (Muñoz et al., 28 Jan 2025).
Semi-structured pruning (N:M, e.g., 2:4 kernel patterns) and one-shot optimal brain surgeon (OBS)-style schemes; achieving 50% unstructured sparsity with <1% drop in accuracy (Tuo et al., 11 Jun 2025).

Optimization best practices (segment parallelization, kernel fusion, mixed-precision, activation recomputation, bridge layers post-pruning) ensure that Mamba maintains linear scaling and hardware friendliness at all stages.

6. Theoretical Properties and Expressiveness

Circuit complexity analysis establishes that, with poly(n)-precision and constant-depth per block, both Mamba and Transformer architectures reside in DLOGTIME-uniform TC $A(x)\in\mathbb{R}^{N\times N}$ 3. Thus, Mamba does not exceed the theoretical expressive power of Transformers—neither can solve NC $A(x)\in\mathbb{R}^{N\times N}$ 4-complete problems (e.g., formula evaluation) unless TC $A(x)\in\mathbb{R}^{N\times N}$ 5 NC $A(x)\in\mathbb{R}^{N\times N}$ 6 (Chen et al., 2024). This places a boundary on what content-dependent SSMs can provably compute in the finite-precision, constant-depth regime.

Further analysis of token dynamics reveals that, in the continuous limit, SSM parameters must be carefully chosen to prevent global collapse or instability. Empirical refinements (positive-definite parameterizations, importance-based token sorting) yield measurable performance improvements (Vo et al., 2024).

7. Future Directions, Limitations, and Impact

Selective SSM/Mamba research is evolving rapidly, with ongoing investigations in scalability, hybridization with attention, compositional memory, and ontological embeddings. While the current paradigm achieves hardware-efficient, high-quality modeling of long-range dependencies, limitations persist in complex attention patterns, purely non-causal data, and scaling laws for trillion-parameter models (Liu et al., 2024, Pióro et al., 2024).

Key open questions include optimal multi-dimensional scan orders, refinement of gating/selection functions, principled architectural scaling (including sparse mixture-of-experts), and domain-specific parameterizations for diverse data structures.

Overall, the Selective Structured State Space Model represents a major step forward in the sequence modeling landscape, offering a rigorously efficient, expressive, and domain-adaptable alternative to attention mechanisms, with strong empirical efficacy and continued theoretical development.