Mamba: Selective Structured State Space Models

Updated 18 December 2025

Mamba is a class of deep sequence models that uses input-dependent parameterization and dynamic gating to achieve efficient, linearly scalable processing.
It unifies elements of recurrent, convolutional, and attention architectures to handle long-context tasks in language, vision, and spatiotemporal graph analysis.
Empirical studies and ablations demonstrate that Mamba outperforms traditional Transformers in diverse applications while maintaining computational efficiency.

Selective Structured State Space Models (Mamba) are a class of deep sequence models that generalize and surpass earlier Structured State Space Models (SSMs) by introducing input-dependent parameterization and dynamic gating, achieving linear complexity in sequence length and enabling efficient, content-aware processing across diverse domains. Mamba unifies the theory of state-space dynamical systems with modern deep learning hardware constraints, combining the strengths of recurrent, convolutional, and attention architectures while remaining highly expressive and scalable. Applications span language modeling, multimodal perception, vision, medical imaging, spatiotemporal graph learning, and reinforcement learning, with state-of-the-art results frequently reported in direct comparison to Transformer architectures.

1. Mathematical Foundations and Selective SSM Formulation

Mamba is fundamentally based on the concept of the discrete-time (or continuous-time) linear State Space Model: $h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t,\qquad y_t = C_t h_t$ where $h_t$ is the hidden state, $x_t$ is the input (with possible token-level or graph-specific structure), $\bar{A}_t$ and $\bar{B}_t$ are discretized from continuous-time matrices $A$ , $B$ using e.g. the zero-order hold method, and made input-dependent via learnable, often neural, selection/gating networks $s_A(\cdot), s_B(\cdot), s_C(\cdot)$ (Gu et al., 2023, Li et al., 19 Mar 2024). This grants the model the ability to adapt its computation dynamically at each time step/token based on local context.

In standard SSMs, $A$ , $B$ , $C$ are fixed after training; in Mamba, these are parameterized or dynamically modulated by the current token/input: $\bar{A}_t = \exp(\Delta_t \odot A),\qquad \bar{B}_t = s_B(x_t),\qquad C_t = s_C(x_t)$ where $\Delta_t$ is a content-dependent step-size (typically using Softplus activation), and $\odot$ denotes elementwise multiplication. The gating/selection mechanism is critical, with $\sigma(\Delta)$ (e.g., sigmoid or softmax) zeroing out uninformative state channels or time steps. This approach enables Mamba to combine expressivity traditionally reserved for full self-attention mechanisms with the efficiency and hardware-friendliness of linear-recurrent systems (Gu et al., 2023, Li et al., 19 Mar 2024).

2. Core Architectural Innovations and Theoretical Properties

The key innovation is selective gating and time-varying parameters; the resulting architecture comprises the following pipeline (Gu et al., 2023, Li et al., 19 Mar 2024, Ye, 3 Jun 2025):

Input projection: Embedding via linear or convolutional layers.
Selective State Space Update: Dynamic per-time-step $A_t$ , $B_t$ , and $C_t$ , often diagonal or block-diagonal for computational efficiency.
Gating and Selection: Elementwise gating modulates which hidden dimensions are updated, similar to input/forget gates in RNNs but parameterized at higher granularity.
Feedforward/MLP or MoE: Parallel or sequential combination of SSM and pointwise nonlinear branches, sometimes Mixture-of-Experts for scalability (Pióro et al., 8 Jan 2024).
Residual Connections and Layer Norm: Standard for stable deep stacking.

The computational complexity is strictly linear in sequence length $L$ (i.e., $O(L\,N)$ for state size $N$ ), both for training and inference. In contrast, full self-attention models exhibit quadratic complexity $O(L^2 d)$ in both compute and memory, where $d$ is the embedding dimension. This enables Mamba to handle sequences of $L>10^4$ tokens and above, crucial for tasks such as long-context language modeling, video, genomics, and large-scale dynamic graph analysis (Asif et al., 28 Nov 2025, Li et al., 19 Mar 2024).

Mamba's selective mechanism recovers the expressive power of self-attention in content-based reasoning—allowing the network to propagate, suppress, or reset state dynamically—while retaining the advantages of a fully parallelizable and scan-friendly recurrence (Gu et al., 2023). Theoretical analysis confirms that in practical settings, selective SSMs (and thus Mamba) have computational expressivity on par with Transformers when implemented in $\mathsf{TC}^0$ constant-depth threshold circuits at fixed or polynomial-precision (Chen et al., 9 Dec 2024).

3. Algorithmic Extensions and Domain Specializations

Mamba has inspired numerous variants and extensions across modalities:

Spatial-Temporal Graph Mamba (STG-Mamba) introduces graph-aware, node-wise selective state-space updates for spatiotemporal graph forecasting, integrating a Kalman-style GNN fusion to handle different temporal granularities. The architecture stacks Graph Selective State Space Blocks, each incorporating a spatial-temporal selective SSM (ST-S3M) and a Kalman Filtering GNN (KFGN), yielding superior accuracy and efficiency on large STG forecasting datasets (Li et al., 19 Mar 2024).
ss-Mamba augments the selective SSM backbone with semantic embeddings (via BERT or similar LLMs) and trainable spline-based Kolmogorov-Arnold temporal encodings, enhancing generalization, interpretability, and robustness in time series forecasting (Ye, 3 Jun 2025).
SparseSSM, PerfMamba, and Mamba-Shedder address model compression and hardware efficiency by analyzing and pruning low-activity, redundant state channels via activity-based or second-order (OBS-style) saliency, achieving speedups up to 1.14–1.4× with minimal or zero accuracy degradation (Tuo et al., 11 Jun 2025, Asif et al., 28 Nov 2025, Muñoz et al., 28 Jan 2025).
Spatial-Mamba and VideoMamba adapt the selective SSM paradigm to images and video by fusing 1D state-space scans with spatial or spatiotemporal neighborhood context via learned convolutional fusion or bidirectional scanning (Xiao et al., 19 Oct 2024, Park et al., 11 Jul 2024).
S²Mamba, DG-Mamba, STG-Mamba extend the selective SSM framework to hyperspectral imaging, dynamic graph learning, and multi-scale spatiotemporal graphs, employing multi-branch SSMs and learnable mixture/fusion gates to disentangle and adapt across multiple axes (e.g., spatial, spectral, temporal) (Wang et al., 28 Apr 2024, Yuan et al., 11 Dec 2024, Li et al., 19 Mar 2024).
MambaMIM, MambaTS, MoE-Mamba elaborate on the basic framework for masked modeling, long-term time-series forecasting, and efficient scaling using MoE layers (Tang et al., 15 Aug 2024, Cai et al., 26 May 2024, Pióro et al., 8 Jan 2024).

4. Empirical Performance and Ablation Studies

Mamba and its derivatives have consistently demonstrated state-of-the-art or superior results across a range of tasks, including but not limited to:

Language Modeling: Mamba-3B matches or exceeds Transformer models twice its size on benchmarks including The Pile, LAMBADA, ARC, and others, while being up to 5× faster at inference (Gu et al., 2023).
Time Series Forecasting: ss-Mamba achieves lower RMSE in both single- and multi-series and superior robustness to context length versus both Transformers and prior SSM-based approaches (Ye, 3 Jun 2025). MambaTS outperforms advanced Transformer-based baselines such as PatchTST and FEDformer on eight LTSF datasets (Cai et al., 26 May 2024).
Vision and Medical Imaging: Spatial-Mamba and MambaMIM yield top results on ImageNet, COCO, ADE20K, BTCV (CT), and other imaging/segmentation benchmarks. MambaMIM, when used as a pre-training strategy, gives large Dice/DSC gains, especially on small organs and in data-constrained medical image domains (Tang et al., 15 Aug 2024, Xiao et al., 19 Oct 2024).
Spatiotemporal Graphs: On STG forecasting datasets (PeMS04, HZMetro, KnowAir), STG-Mamba attains the lowest RMSE and highest $R^2$ among published methods, while reducing inference time and maintaining linear FLOPs scaling (Li et al., 19 Mar 2024).
Speech and Video: Speech-Mamba and Dual-path Mamba outperform competitive Transformer and SSM baselines on LibriSpeech and WSJ0-2mix tasks; VideoMamba outperforms VideoSwin and other video recognition models on Something-Something V2 and HMDB51 at significantly lower GFLOPs and parameter counts (Gao et al., 27 Sep 2024, Jiang et al., 27 Mar 2024, Park et al., 11 Jul 2024).
Ablation Evidence: Component ablations consistently confirm that the selective gating, input-dependence of $A, B, C$ , and fusion modules (e.g., Kalman fusion, mixture gates) are critical—removal degrades performance to or below strong GNN or SSM-only baselines (Li et al., 19 Mar 2024, Cai et al., 26 May 2024).

5. Model Compression, Efficiency, and Interpretability

Selective SSMs (and Mamba) facilitate aggressive model pruning due to the inherent sparsity in state-channel activity and structured parameterization:

One-Shot Pruning: SparseSSM applies a closed-form, second-order saliency criterion derived from the diagonals of the SSM's state-transition matrix, achieving 50% sparsity in SSM parameters without zero-shot degradation, and competitive or superior results to other attention-shedder or GPT-based sparsity techniques (Tuo et al., 11 Jun 2025).
Performance Profiling: PerfMamba analysis indicates that the SSM block is the dominant compute/memory bottleneck in both Mamba-1 and Mamba-2; channel activity scores from the selection gate ( $\Delta_t$ ) can be used to prune 10–50% of channels for 1.10–1.14× speedup with <1% accuracy loss, providing precise guidelines for best-practice deployment (Asif et al., 28 Nov 2025).
Interpretability: MambaLRP introduces stable, axiomatically faithful Layer-wise Relevance Propagation adapted to selective SSMs, enabling analysis and explanation of long-range information flow and model decision pathways (Jafari et al., 11 Jun 2024). Further, spline-based encodings in ss-Mamba yield directly interpretable time functions (Ye, 3 Jun 2025).

6. Complexity Analysis and Computational Expressivity

The core Mamba block—a composition of linear projections, convolution, selection, gating, and SSM recurrence—can be implemented with O(1) constant circuit depth and polynomial size using threshold circuits under polynomial-precision floats $\TC^0$. This position is exactly analogous to known results for softmax- or average-attention Transformers and demonstrates that, at finite precision and fixed depth, neither Mamba nor Transformers are strictly more computationally expressive.

This theoretical limitation holds for canonical settings (constant-depth, uniform circuits), though "depth" can be constructed by stacking many blocks as in practical N-layer architectures (where N can be tens or hundreds), and expressivity is empirically sufficient for state-of-the-art prediction, sequence modeling, and reasoning.

7. Open Research Directions and Practical Considerations

Numerous open challenges and future work areas for Selective SSM/Mamba models are highlighted:

Generalization Beyond Language and Vision: While SSM-based Mamba architectures have demonstrated SOTA in vision, graph, and time-series tasks, applications to social graphs, multimodal fusion, high-dimensional control, and more complex scientific data remain active areas (Yuan et al., 11 Dec 2024, Asif et al., 28 Nov 2025).
Model Interpretability and Black-box State Evolution: Understanding the interpretable evolution of content-dependent latent states, visualizing attention-like selectivity, and developing further explainability tools are ongoing (Jafari et al., 11 Jun 2024).
Handling Missing Data and Robustness: Incorporating structured imputation, uncertainty quantification (e.g., via Kalman filtering), and adversarial robustness mechanisms are core concerns for practical deployment (Li et al., 19 Mar 2024, Yuan et al., 11 Dec 2024).
Hybrid Architectures and Foundation Models: Combining selective SSMs with attention, memory, CNN, and MoE mechanisms, scaling up to multi-billion/trillion parameters, and integrating foundation model pre-training techniques (semantic embeddings, meta-learning, multi-modal input) are active research themes (Ye, 3 Jun 2025, Pióro et al., 8 Jan 2024, Liu et al., 7 May 2024).

In summary, Selective Structured State Space Models (Mamba) enable content-dependent, linearly efficient sequence modeling with hardware-friendly compute patterns, matching or exceeding Transformer benchmarks across modalities and tasks, and showing rapid evolvability for domain-specific adaptation, model compression, and foundation model construction (Gu et al., 2023, Li et al., 19 Mar 2024, Ye, 3 Jun 2025, Asif et al., 28 Nov 2025).