Papers
Topics
Authors
Recent
2000 character limit reached

Vision Mamba Models

Updated 3 February 2026
  • Vision Mamba Models are efficient visual backbones that employ dynamic state-space models to capture both local and global features in image and video tasks.
  • They utilize bidirectional and multidirectional scanning with selective gating to achieve linear computational complexity and high throughput.
  • Their design supports hardware-friendly, parallel implementations and scales to high resolutions while matching or surpassing transformer accuracy.

Vision Mamba Models refer to a class of vision backbones based on state-space models (SSMs), specifically Mamba, that aim to provide efficient, scalable, and context-aware alternatives to transformer- and convolution-based visual representation learning. These architectures introduce bidirectional and selective scan mechanisms, spatiotemporal pipelines, and a set of architectural innovations that together offer linear computational and memory complexity while preserving or surpassing the accuracy of conventional models for image and video understanding tasks (Ibrahim et al., 11 Feb 2025, Zhu et al., 2024, Wang et al., 2024).

1. Mathematical Foundations: State-Space Models and Selectivity

At the heart of Vision Mamba is the continuous-time linear SSM, parameterized by dynamic matrices that are input-dependent and vary along the sequence and batch dimension. Formally, for a sequence of embedded tokens x1:MRM×Ex_{1:M}\in\mathbb{R}^{M\times E}, the SSM evolves a hidden state stRNs_t\in\mathbb{R}^N:

st=Ast1+Bxt,yt=Cst,s_t = A s_{t-1} + B x_t, \qquad y_t = C s_t,

with ARN×NA\in\mathbb{R}^{N\times N}, BRN×EB\in\mathbb{R}^{N\times E}, and CRE×NC\in\mathbb{R}^{E\times N}. In Vision Mamba, these matrices are not fixed but are generated dynamically for each token and batch element through learned projections on the input.

Specifically, for each Mamba block at layer ll, the input tensor Tl1RB×M×DT_{l-1}\in\mathbb{R}^{B\times M\times D} is normalized and projected to produce xx and gating vector zz:

Tl1=Norm(Tl1),x=WxTl1,z=WzTl1,T_{l-1}' = \mathrm{Norm}(T_{l-1}), \quad x = W^x T_{l-1}', \quad z = W^z T_{l-1}',

where both x,zRB×M×Ex,z\in\mathbb{R}^{B\times M\times E}. Direction-specific convolutions and non-linearities extract per-time-step SSM parameters, enabling the block to modulate its recurrent dynamics selectively:

xo=SiLU(Conv1do(x));Bo=WBoxo,Co=WCoxo,Δo=log(1+exp(WoΔxo+poΔ)).x_o' = \mathrm{SiLU}(\mathrm{Conv1d}_o(x)); \quad B_o = W^{B_o} x_o', \quad C_o = W^{C_o} x_o', \quad \Delta_o = \log(1 + \exp(W^{\Delta}_o x_o' + p_o^\Delta)).

Time-varying matrices are constructed as:

Aˉo=ΔoPoA,Bˉo=ΔoBo,\bar{A}_o = \Delta_o \otimes P_o^A, \qquad \bar{B}_o = \Delta_o \otimes B_o,

where PoARE×NP_o^A\in\mathbb{R}^{E\times N} serves as a base direction-specific transition.

The selective scan is achieved through:

yo=SSM(Aˉo,Bˉo,Co)(xo),y_o = \mathrm{SSM}(\bar{A}_o, \bar{B}_o, C_o)(x_o'),

followed by gating and fusion:

y~fwd=yfwdSiLU(z),y~bwd=ybwdSiLU(z),\tilde{y}_{\mathrm{fwd}} = y_{\mathrm{fwd}} \odot \mathrm{SiLU}(z), \quad \tilde{y}_{\mathrm{bwd}} = y_{\mathrm{bwd}} \odot \mathrm{SiLU}(z),

Tl=WT(y~fwd+y~bwd)+Tl1.T_l = W^T(\tilde{y}_{\mathrm{fwd}} + \tilde{y}_{\mathrm{bwd}}) + T_{l-1}.

The selectivity emerges from the gating SiLU(z)\mathrm{SiLU}(z) and per-position scaling Δo\Delta_o, allowing adaptive focus and computational allocation across the sequence (Ibrahim et al., 11 Feb 2025).

2. Bidirectional, Multidimensional, and Selective Scanning

To adapt sequential SSMs to vision tasks, Vision Mamba employs various scanning strategies:

  • 1D bidirectional scanning: Each block performs forward and backward scans over the token sequence, ensuring tokens have both past and future spatial or temporal context (Zhu et al., 2024).
  • 2D and multi-directional scanning: In advanced variants (e.g., VMamba, V2M), multiple flattening and scan routes are used—such as four-way (corner-to-corner) scans—to recover 2D spatial inductive bias and aggregate information along multiple axes (Wang et al., 2024, Rahman et al., 2024).
  • Parallel scan algorithms: Efficient implementations employ parallel prefix-sum algorithms (e.g., Blelloch scan), reducing the sequential depth of the recurrence from O(L)O(L) to O(logL)O(\log L) for sequence length LL, as in FastVim, which further reduces depth by alternate spatial pooling (Kapse et al., 1 Feb 2025).
  • Selective gating and dynamic adaptation: Vision Mamba blocks use learned gates and per-position modulation, offering position-aware and input-adaptive recurrence that "filters" uninformative positions and enhances contextual integration (Ibrahim et al., 11 Feb 2025).

3. Spatiotemporal Extensions and VideoMamba

Vision Mamba has been generalized for spatiotemporal data (e.g., video) through VideoMamba and related variants:

  • Temporal scan: Recurrence across frame sequences at a fixed spatial position captures temporal dependencies.
  • Spatial scan: Per-frame scanning along spatial dimensions enables spatial context aggregation.
  • Cross-scan and fusion modules: Mechanisms like Structure-Aware State Fusion (SASF) and Spatial & Channel Attention (SCAttn) transfer information between parallel scan streams, enabling integration of global and local features (Ibrahim et al., 11 Feb 2025).
  • Hierarchical and multiscale design: Direction alternation and multi-scale strategies, as in Hi-Mamba and Multi-Scale VMamba, allow local patterns to be processed early and global context to be integrated later, balancing model capacity and efficiency (Shi et al., 2024).

These architectural principles collectively ensure that Vision Mamba and its descendants attain both linear complexity and robust modeling of long-range dependencies in spatiotemporal data (Rahman et al., 2024, Ibrahim et al., 11 Feb 2025).

4. Architectural Innovations and Position Embedding

Vision Mamba introduces several architectural enhancements beyond the core SSM block:

  • Position embeddings: Small bias terms in convolutional kernels or step parameters (e.g., Aˉo\bar{A}_o, Bˉo\bar{B}_o) inject positional information, critical due to the loss of explicit spatial relationships during sequence flattening (Ibrahim et al., 11 Feb 2025, Zhu et al., 2024).
  • Cross-scan and fusion modules: Modules for mixing information from different scan directions (e.g., SASF, SCAttn) are critical for fusing spatial and temporal streams and recovering 2D/3D locality (Ibrahim et al., 11 Feb 2025).
  • Hierarchical and multiscale grouping: Direction Alternation Hi-Mamba Groups (DA-HMGs) and multi-scale scanning preserve sensitivity to both local and global patterns by alternating processing at different scales and directions (Ibrahim et al., 11 Feb 2025, Shi et al., 2024).
  • Parallel, hardware-aware implementation: All major operations in the Mamba block (matrix-vector products, convolutions) are cast as 1D convolutions and recurrences, enabling efficient CUDA kernel fusion and maximal GPU utilization (Zhu et al., 2024).

These innovations are motivated by the need to balance inductive bias for spatial/temporal structure, maintain global receptive fields, and ensure hardware efficiency.

5. Computational and Memory Complexity

Vision Mamba achieves linear complexity per layer with respect to sequence length (O(M)O(M) for MM tokens), in sharp contrast to the O(M2)O(M^2) complexity of transformer self-attention. This is analytically established for:

  • Bidirectional Vision Mamba blocks (per-layer cost): Two O(M)O(M) SSM recursions plus a sequence of low-cost linear or convolutional operations (Ibrahim et al., 11 Feb 2025).
  • Memory footprint: Mamba blocks only require current/past state vectors instead of full M×MM\times M attention matrices, leading to substantial savings in both FLOPs and memory (Zhu et al., 2024).
  • Empirical efficiency: ViM achieves 2.8× higher throughput and up to 86.8% less GPU memory usage compared to DeiT for high-resolution images (e.g., 1248×12481248 \times 1248) during batch inference (Zhu et al., 2024).
  • Parallel scan optimizations: FastVim reduces scan depth by half using interleaved pooling and achieves up to 72.5% inference speedup at 2048×20482048 \times 2048 resolution with almost no loss in accuracy (Kapse et al., 1 Feb 2025).

6. Empirical Performance and Ablative Findings

Across core computer vision benchmarks, Vision Mamba and its variants consistently match or surpass the accuracy of attention-based backbones at lower compute:

Model Params (M) ImageNet Top-1 (%) Throughput (img/s) COCO box AP ADE20K mIoU
Vim-Tiny 7 72.8 1,050
Vim-Small 26 79.3 2,100 46.7
Vim-Base 98 82.1 1,500 50.5
ViT-Base 86 81.8 600 48.7
  • Bidirectionality and gating: Bidirectional scanning offers a 0.5–0.8% gain over unidirectional, and learned gating provides additional 0.3–0.6% top-1 accuracy, with significant improvements in dynamic focus for moving objects in video (Ibrahim et al., 11 Feb 2025).
  • Ablation of key architectural components: Removing cross-scan fusion or hierarchical alternation degrades action recognition and reduces robustness to view changes; diminishing returns are observed beyond a certain number of hierarchical layers (Ibrahim et al., 11 Feb 2025).
  • Positional embeddings: Learnable 2D embeddings outperform fixed or 1D embeddings, with a typical 0.8% top-1 improvement on ImageNet (Zhu et al., 2024).
  • Scaling: Vision Mamba models scale to high resolutions and longer token sequences without the throughput collapse seen in traditional ViTs, retaining graceful degradation in accuracy.

7. Extensions, Limitations, and Future Directions

Advancements and ongoing directions related to Vision Mamba include:

  • Spatiotemporal and multimodal pipelines: Extensions to video (VideoMamba), video-language, and low-level tasks (segmentation, dense prediction) through tailored scan/fusion modules and multiscale design (Ibrahim et al., 11 Feb 2025, Rahman et al., 2024).
  • Model compression and acceleration: Dynamic Vision Mamba (DyVM) introduces token and block redundancy reduction via training-aligned pruning and dynamic block selection, achieving up to 35% FLOPs reduction with minimal accuracy loss (Wu et al., 7 Apr 2025). FastVim's pooling strategies further accelerate inference (Kapse et al., 1 Feb 2025).
  • Stability and inductive bias: Adaptive scan strategies and positionally-aware embeddings aim to resolve the loss of spatial priors and causality mismatches in classic SSM-to-vision adaptations (Xu et al., 2024).
  • Interpretability, generalization, robustness: Open challenges persist in understanding the interpretability of SSM state evolution, domain generalization to out-of-distribution data, adversarial robustness involving dynamically learned B,C,Δ parameters, and theoretical scaling laws (Xu et al., 2024, Rahman et al., 2024).
  • Hybrid architectures: Integration of convolutional, attention-based, and SSM modules (e.g., MambaVision) seeks to harness complementary inductive biases for edge and global structure extraction (Hatamizadeh et al., 2024).
  • Next-generation models: Research is aiming towards SSM–attention fusion, high-dimensional scan patterns, unified multi-modal architectures (including video and language), and further hardware and distributed system optimization (Ibrahim et al., 11 Feb 2025, Rahman et al., 2024).

Vision Mamba thus represents a rapidly maturing paradigm for scalable, context-aware visual representation learning, with demonstrable benefits in linear complexity, global context integration, and experimental performance across the full spectrum of visual tasks (Ibrahim et al., 11 Feb 2025, Zhu et al., 2024, Wang et al., 2024, Rahman et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Mamba Models.