Papers
Topics
Authors
Recent
2000 character limit reached

Vision Mamba: Efficient SSM Vision Backbone

Updated 23 December 2025
  • Vision Mamba is a family of vision backbones that uses state space models and selective scan strategies to enable efficient and scalable visual representation learning.
  • It replaces quadratic self-attention with linear-time SSM recurrences and tailored scanning routes to achieve global context aggregation while reducing memory and computation.
  • Empirical results demonstrate that VM architectures outperform Vision Transformers in throughput, memory efficiency, and accuracy across tasks like classification, segmentation, and medical imaging.

Vision Mamba (VM) is a family of vision backbones that adapt state space models (SSMs), specifically the "Mamba" selective scan paradigm, for efficient and scalable visual representation learning. By replacing quadratic-time self-attention with linear-time SSM recurrences and introducing tailored scanning strategies, VM architectures achieve global context aggregation while dramatically improving throughput, memory efficiency, and resolution scalability compared to Vision Transformers (ViTs). Vision Mamba serves as the foundation for a broad ecosystem of general- and domain-specific visual models, with empirical superiority demonstrated in a range of tasks including classification, dense prediction, 3D/medical imaging, and visual generation.

1. State Space Model Foundation and Selective Scan Mechanism

VM builds on the structured state-space formulation inherited from recent advances in SSMs for sequence modeling. A continuous-time SSM is characterized by:

h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A\,h(t) + B\,x(t), \qquad y(t) = C\,h(t)

where h(t)RNh(t) \in \mathbb{R}^N is a hidden state and x(t),y(t)x(t), y(t) are inputs/outputs. Discretization (typically via zero-order hold) yields:

ht=Aht1+Bxt,yt=Chth_t = \overline{A}\,h_{t-1} + \overline{B}\,x_t, \qquad y_t = C\,h_t

A=exp(ΔA),B=(ΔA)1(exp(ΔA)I)ΔB\overline{A} = \exp(\Delta A),\quad \overline{B} = (\Delta A)^{-1}(\exp(\Delta A)-I)\,\Delta B

Mamba's defining insight is to promote the SSM kernels from time-invariant to selective, making A,B,C\overline{A}, \overline{B}, C token-dependent via pointwise neural networks. This instantiates a "selective scan": at each sequence position, the recurrence parameters adapt dynamically, enabling input-dependent contextualization at linear O(L)O(L) cost, where LL is sequence length (Zhu et al., 17 Jan 2024, Zhang et al., 24 Apr 2024, Xu et al., 29 Apr 2024).

For vision, the scan order is crucial: VM "scans" tokens derived from image patches along one or more spatial traversals (e.g., rows, columns, diagonals). Bidirectional and multidirectional scans are often employed to compensate for the inherent causality of SSMs in a 1D sequence (Liu et al., 18 Jan 2024, Zhu et al., 17 Jan 2024).

2. Vision Mamba Architectures and Block Design

VM manifests in multiple architectural variants, most prominently as ViM and VMamba backbones:

  • ViM block: Stacks bidirectional selective SSMs, unidirectionally scanning a rasterized patch sequence forward and backward. The outputs are fused (e.g., weighted sum, MLP), achieving a full global receptive field at every token (Zhu et al., 17 Jan 2024). A learnable positional embedding is typically added to each patch token.
  • VMamba block and SS2D module: Generalizes ViM by fusing multiple scanning routes (e.g., horizontal, vertical, and their reverses) via the Selective Scan 2D (SS2D) module. Each scan applies a separate SSM, and the block merges their outputs for robust, isotropic global context (Liu et al., 18 Jan 2024, Xu et al., 29 Apr 2024).
  • Hybrid blocks: Incorporate depthwise convolution or channel-mixing MLPs alongside SSM blocks, injecting local inductive biases to balance the global context of the SSM (Liu et al., 18 Jan 2024, Shi et al., 23 May 2024). Multi-scale and hierarchical pooling are also employed in various descendants (e.g., MSVMamba, FastVim).

Architectural hyperparameters (e.g., state size, expansion ratios, number of directions) are tuned for per-task tradeoff among FLOPs, throughput, and accuracy.

3. Computational and Memory Complexity

A defining property of VM is its strict linear scaling with sequence length and input resolution:

Model Type Time Complexity Memory Complexity
CNN layer O(kL)O(kL) O(L)O(L)
Transformer MSA O(L2)O(L^2) O(L2)O(L^2)
Vision Mamba O(Ld2)O(Ld^2) O(Ld)O(Ld)

(Where LL = #tokens, kk = kernel size, dd = feature dimension) (Zhu et al., 17 Jan 2024, Liu et al., 18 Jan 2024, Zhang et al., 24 Apr 2024, Kashefi et al., 16 Oct 2025).

Empirically, VM preserves accuracy as input resolution increases, unlike Transformers, which either collapse in throughput/memory (quadratic bottleneck) or necessitate aggressive token downsampling. Benchmarks report:

Such scaling enables models to process fine-grained or volumetric data (medical, remote sensing, porous media) at resolutions infeasible for attention-based architectures.

4. Diverse Scanning and Pooling Strategies

Proper adaptation of 1D SSMs to 2D/3D vision necessitates sophisticated scan and fusion schemes:

  • Single/bidirectional scans: Mitigate causality; bidirectional always increases receptive field (Zhu et al., 17 Jan 2024, Liu et al., 18 Jan 2024).
  • Multi-directional (cross-scan): E.g., four-way (row/col reversible) or eight-way (diagonal, zigzag) scans for full spatial mixing (Liu et al., 18 Jan 2024, Xu et al., 29 Apr 2024).
  • Pooling and multiscale: FastVim alternately pools over spatial axes to halve computation depth; MSVMamba performs coarse scans on downsampled features, then refines via upsampling, optimizing the speed–accuracy frontier (Kapse et al., 1 Feb 2025, Shi et al., 23 May 2024).
  • Dimension-independent and manifold scanning: Extensions such as MobileViM and SiM introduce dimension-agnostic projections or patchings for 3D volumes and spherical data, using specialized scan/fusion to respect data topology (Dai et al., 19 Feb 2025, He et al., 24 Jan 2025).
  • Local-global token mixing: LoG-VMamba concatenates local and global neighborhoods per token, maintaining both spatial adjacency and rapid global summary (Dang et al., 26 Aug 2024).

These strategies maintain linear complexity, avoid expensive pairwise attention, and ensure global receptive field even on non-Euclidean or highly anisotropic data.

5. Empirical Results Across Tasks and Modalities

Vision Mamba's effectiveness has been validated on canonical and specialized benchmarks:

ImageNet-1K Classification (224²):

Model Params FLOPs Top-1 Acc (%)
Vim-Ti 5.4M 1.1G 72.8
DeiT-Ti 5.7M 1.3G 71.2
Vim-S 21M 4.6G 80.9
DeiT-S 22M 4.6G 79.8
Vim-B 88M 17.2G 83.0
DeiT-B 86M 17.5G 81.8

(Zhu et al., 17 Jan 2024)

Downstream Dense Prediction:

Medical and 3D Applications:

  • Alzheimer's 3D MRI: VM accuracy 65% vs. CNN 49% vs. Transformer 46% (A et al., 9 Jun 2024).
  • Breast Ultrasound: VMamba-ti achieves 89.36% accuracy, outperforming all ResNet/VGG/ViT baselines (p<0.05) (Nasiri-Sarvi et al., 4 Jul 2024).
  • Medical Image Synthesis: VM-DDPM FID 11.783 (ChestXRay), outperforming CNN- and Transformer-based DDPM at 21.695–23.679 (Ju et al., 9 May 2024).
  • 3D Segmentation: MobileViM_s processes 64³ volumes at 91 FPS (RTX 4090) with 92.72% Dice on PENGWIN, beating nnU-Net, SegMamba₃D, SwinUNETR-V2 (Dai et al., 19 Feb 2025).

Visual Generation and Restoration:

  • Super-Resolution: DVMSR (Distillated Vision Mamba) matches or exceeds state-of-the-art SSIM with only 0.424M parameters (Lei et al., 5 May 2024).

6. Extensions: Hybridization, Multiscale, and Specialized Modules

The VM backbone serves as a foundation for diverse vision architectures:

Autoregressive pretraining (ARM) leverages the unidirectional SSM structure for extremely fast and scalable contrastive pretraining, with base-size ARM models exceeding 83% ImageNet accuracy and scaling stably to 0.7B parameters (Ren et al., 11 Jun 2024).

7. Open Challenges and Future Directions

Despite VM's rapid adoption, several challenges and research opportunities remain:

Vision Mamba continues to generalize across modalities, rapidly absorbing advances from structured modeling, efficient architecture search, and large-scale pretraining, positioning SSM-based backbones as foundational models for the next era of efficient, scalable visual understanding.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vision Mamba (VM).