Mamba: Selective State Space Models in Deep Learning

Updated 3 August 2025

Mamba is a class of selective state space models (SSMs) that dynamically adjust parameters based on input for efficient long-sequence processing.
It unifies recurrence and convolution through a hardware-optimized parallel scan, achieving linear computational complexity.
Mamba has been applied in time series forecasting, computer vision, surrogate PDE simulation, and recommendation, offering improved efficiency and competitive performance.

Mamba is a class of selective, input-dependent state space models (SSMs) that provide an alternative paradigm to classical attention mechanisms in deep learning, unifying recurrence and convolution within a hardware-optimized framework. Mamba models operate with linear computational complexity in sequence length and have been rapidly adopted across time series forecasting, computer vision, tabular modeling, surrogate PDE simulation, medical imaging, robotics, and multimodal fusion. The selective SSM principle at Mamba’s core enables dynamic context propagation—adjusting state evolution locally via learned, input-conditioned parameters—while scan-based, hardware-aware implementation yields practical throughput and scaling in long-sequence and high-dimensional domains.

1. Mathematical Formulation and Core Mechanism

At its foundation, Mamba is built upon the discrete-time state space model, but crucially modulates parameters dynamically by input:

Continuous form:

$\mathbf{h}'(t) = \mathbf{A} \mathbf{h}(t) + \mathbf{B} \mathbf{x}(t), \quad \mathbf{y}(t) = \mathbf{C} \mathbf{h}(t)$

Discrete (with step $\Delta$ ):

$\mathbf{h}_t = \overline{\mathbf{A}} \mathbf{h}_{t-1} + \overline{\mathbf{B}} \mathbf{x}_t, \quad \mathbf{y}_t = \mathbf{C} \mathbf{h}_t$

where

$\overline{\mathbf{A}} = \exp(\Delta \mathbf{A}), \quad \overline{\mathbf{B}} = (\exp(\Delta \mathbf{A}) - \mathbf{I}) \mathbf{A}^{-1} \Delta \mathbf{B}$

The essence of Mamba’s selectivity is that $\mathbf{B}$ , $\mathbf{C}$ , and sometimes the discretization $\Delta$ are functions of $\mathbf{x}_t$ , generally parameterized by a linear projection. This dynamic parameterization converts the SSM from a time-invariant to a locally-adaptive process, allowing the model to propagate or forget information in an input-dependent manner.

Architecturally, Mamba blocks frequently expand channel dimensions via linear layers, apply 1D convolutions with nonlinearities (e.g., SiLU), and integrate the selective SSM operation within an MLP or residual module. Bidirectional operation—parallel forward and backward scans—is widely employed in vision and time series tasks to recover a full receptive field previously unavailable in causal SSMs (Wang et al., 17 Mar 2024, Xu et al., 29 Apr 2024).

2. Comparison to Self-Attention and Other Predecessors

Mamba systematically addresses bottlenecks in both Convolutional Neural Networks (CNNs) and Transformer-based attention:

Model Class	Complexity	Long-range Dependency	Receptive Field
CNN	O(L)	Weak (local)	Increases with depth
Transformer	O(L²)	Strong (global, via attention)	Global
Mamba (SSM)	O(L)	Strong (via recurrent/scan)	Global (with BiScan)

Transformers compute explicit all-to-all similarities via self-attention, incurring $O(L^2)$ cost in both compute and memory, fundamentally limiting their scalability for long sequences and high-resolution vision. CNNs are efficient but cannot aggregate global context without stacking many layers. Mamba’s SSM recurrence enables efficient context propagation across arbitrarily long input because the state updates recursively accumulate information, and selective parameterization makes this process input-adaptive (Xu et al., 29 Apr 2024, Liu et al., 7 May 2024).

Hybrid models—where Mamba SSM layers replace or augment attention modules—leverage the strengths of both paradigms and have become common in foundation models (Zou et al., 24 Jun 2024). The SSM mechanism itself can be interpreted as a scanning kernel, and both SSMs and Transformers can be situated within a unified kernel framework, providing theoretical links between the two (Zou et al., 24 Jun 2024).

3. Efficient Sequence Modeling: Hardware-Aware Parallel Scan

Mamba’s practical impact is derived from its hardware-optimized scan algorithms. Unlike traditional SSMs that process sequences strictly sequentially, Mamba replaces recurrence with a parallel scan, exploiting GPU memory hierarchy (e.g., fusing per-thread register computations with segmented kernel fusion) to accelerate training and inference while maintaining low memory overhead (Liu et al., 7 May 2024). This scan-based parallelization achieves true $O(L)$ throughput in both training and inference, a requirement for modeling long sequences (e.g., million-token inputs) in visual or language domains.

The locally bi-directional variant (LBMamba) further embeds lightweight backward scans inside the forward scan at tile level, eliminating the need for an expensive global backward pass and maintaining full receptive field coverage with minimal extra FLOPs (27% increase, only 2% CUDA slowdown) and substantial throughput/memory gains (Zhang et al., 19 Jun 2025).

4. Applications Across Domains

a. Time Series Forecasting

Mamba-based models such as Simple-Mamba (S-Mamba) match or surpass strong Transformer baselines (iTransformer, PatchTST, Crossformer, etc.) on 13 public time series datasets. S-Mamba’s bidirectional Mamba block efficiently encodes inter-variate correlations; the architecture utilizes:

Variate-autonomous tokenization via linear layers,
Bidirectional Mamba for cross-channel (VC) encoding,
Feed-forward network for temporal (TD) dependencies,
Projection for forecasting (Wang et al., 17 Mar 2024).

In periodic, highly-correlated datasets (e.g., Traffic, Electricity), S-Mamba attains leading or competitive performance under mean squared error (MSE) and mean absolute error (MAE) while reducing GPU memory and training time—making it preferable for resource-constrained or real-time applications.

b. Computer Vision

Vision Mamba and variants have been quickly embraced for image classification, segmentation, restoration, generation, video understanding, 3D (point cloud), and multimodal fusion (Xu et al., 29 Apr 2024, Liu et al., 7 May 2024, Rahman et al., 4 Oct 2024).

Critical design aspects include:

Sophisticated scanning: raster, zigzag, spiral, bidirectional, multi-axis, and local/tile-based scan conversions of 2D/3D inputs for SSM processing (Xu et al., 29 Apr 2024, Rahman et al., 4 Oct 2024).
Modules such as deformable token generators (DTMB, for robust object detection under occlusion/small object regimes in UAVD-Mamba (Li et al., 1 Jul 2025)), or conditional positional encoding in point-mamba segmentation (Wang et al., 17 Jul 2024).
Hybrid architectures for U-Net segmentation, multi-head/local scan, or combining Mamba SSMs with convolutions and attention in medical imaging (Bansal et al., 3 Oct 2024).
Substantial reduction in memory usage and inference latency in high-resolution or multi-modal scenarios.

c. Tabular Recommendation

The FT-Mamba model replaces Transformer blocks with Mamba SSM layers within an FT-Transformer architecture. Retaining the feature tokenizer, FT-Mamba tokenizes numerical/categorical features, appends a [CLS] token at the end, and demonstrates matched or superior performance (precision, recall, MRR, HR) versus Transformer baselines, with parameter reduction and improved linear scalability—exemplified by HR@1 of 97.7% on Spotify (vs. 84.11% for Transformer) (Starnes et al., 11 Sep 2024).

d. Surrogate Modeling in PDEs

LE-PDE++ utilizes Mamba for latent dynamic evolution in PDE surrogates (fluid, pollutant transport, shallow water). Mamba’s linear latent space update:

$\mathbf{h}_t = \exp(\Delta A)\mathbf{h}_{t-1} + (\Delta A)^{-1}(\exp(\Delta A) - I)\Delta B \mathbf{x}_t$

enables causal convolution over time, doubling inference speed versus prior dynamics modules, maintaining parameter efficiency, and preserving accuracy with progressive sampling strategies (Liang et al., 4 Nov 2024).

5. Innovations in Model Architecture and Scanning

Significant advances have focused on improving spatial context and multimodal processing:

Deformable Token Mamba Block (DTMB) generates deformable and normal tokens through classic and deformable convolutions, feeding a Mamba block for scale-adaptive detection (UAVD-Mamba) (Li et al., 1 Jul 2025).
Serialized Point Mamba serializes 3D point clouds via space-filling curves, integrating staged learning, grid pooling, and CPE for scalable segmentation with linear complexity (Wang et al., 17 Jul 2024).
DM-Mamba for MRI reconstruction leverages dual-domain (k-space/image) hierarchical Mamba with circular scanning and multi-scale strategies to preserve frequency structure and reduce forgetting (Meng et al., 14 Jan 2025).

The locally bi-directional scanning (LBMamba) embeds a backward scan within local tiles processed in per-thread registers, removing the need for an expensive global backward scan and preserving efficiency (Zhang et al., 19 Jun 2025). Notably, alternating scan directions at layer boundaries allows full receptive field recovery without communication overhead.

6. Empirical Performance and Impact

Across domains, Mamba models consistently provide accuracy on par with or surpassing Transformer or CNN baselines while improving efficiency:

S-Mamba outperforms patch- and attention-based models on periodic, high-dimensional TSF tasks and reduces memory usage (Wang et al., 17 Mar 2024).
LBVim with LBMamba achieves 0.8–1.6% higher top-1 accuracy on ImageNet-1K, 0.6–2.7% higher mIoU on ADE20K, and up to 3.39% F1 improvement in pathology WSI MIL compared to prior MambaMIL (full global bi-directional scan) at no cost in throughput (Zhang et al., 19 Jun 2025).
In optical flow (MambaFlow), EPE of 1.43 at 0.113 s inference on Sintel surpasses GMFlow (18.9% lower EPE and 18.1% faster) and other SOTA, giving high deployment value for real-time applications (Du et al., 10 Mar 2025).
In personalized recommendations, reduced parameter counts coupled with better or matched accuracy metrics make Mamba appealing for production-scale recommender systems (Starnes et al., 11 Sep 2024).

The increase in models and variants—for video (VideoMamba), medical imaging (U-/Mamba-UNet, MedMamba), fusion (VL-Mamba), and generative modeling (PhyxMamba, with attractor geometry regularization for chaotic systems) (Liu et al., 29 May 2025)—demonstrates rapid community uptake and versatile applicability.

7. Challenges, Limitations, and Future Prospects

Current limitations and research directions include:

Stability and generalization: Non-causal sequence scanning (required for 2D/3D) can induce instabilities (e.g., vanishing/exploding gradients), and scan redundancy may partially erode linear efficiency gains (Xu et al., 29 Apr 2024).
Preservation of spatial context: Transforming images to sequences can risk spatial structure loss; innovations in scan/fusion (e.g., zigzag, multi-axis, local windows) are active areas of refinement (Liu et al., 7 May 2024, Rahman et al., 4 Oct 2024).
Model interpretability: The input-conditioned dynamics and hierarchical state propagation complicate explainability compared to explicit attention maps, motivating new theoretical and empirical analyses (Rahman et al., 4 Oct 2024).
Coordination with attention: Hybrid and dual frameworks (Mamba-2, SSD) may alleviate cases where fine-grained context is sacrificed for efficiency, with kernel-based unification offering insight into trade-offs (Zou et al., 24 Jun 2024).
Community and model availability: Pretrained Mamba models and mature open-source repositories continue to be developed, enabling further scaling, transfer learning, and benchmarking (Xu et al., 29 Apr 2024, Liu et al., 7 May 2024, Rahman et al., 4 Oct 2024).

This ongoing evolution, hardware awareness, and demonstrated empirical strength in long-sequence, real-world, and high-dimensional settings position Mamba architectures as central tools in current and next-generation deep learning systems.