Vision Mamba: Efficient Global Context Modeling

Updated 15 January 2026

Vision Mamba is a family of vision models that replace quadratic self-attention in Vision Transformers with linear, selective state-space modeling for efficient global context.
It employs bidirectional scanning, input-dependent gating, and hardware-aware SSM blocks to support scalable and low-complexity processing of large images and sequences.
Empirical results demonstrate strong performance across classification, detection, segmentation, and multimodal tasks, positioning it as a unified paradigm in computer vision.

Vision Mamba is a family of vision foundation models that replace quadratic-cost self-attention in Vision Transformers with linear-complexity selective state-space modeling. By leveraging bidirectional scanning, input-dependent gating, and hardware-aware State Space Model (SSM) blocks, Vision Mamba achieves efficient global context modeling, scalability to large images and sequences, and strong empirical performance across a wide array of computer vision applications. The architecture is positioned as a unifying paradigm that combines the expressive power and context-awareness of transformers with the efficiency and inductive biases of state-space modeling, exhibiting rapid adoption and innovation in classification, detection, segmentation, medical, multimodal, and scientific imaging tasks (Ibrahim et al., 11 Feb 2025, Xu et al., 2024, Zhu et al., 2024, Liu et al., 2024).

1. Mathematical and Algorithmic Foundations

The core of Vision Mamba is the selective, input-dependent State Space Model (SSM), which extends classical linear time-invariant sequence models by allowing dynamic parameterization per token. The canonical continuous and discrete-time SSM recurrences are: $\dot h(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ Via zero-order hold (ZOH) discretization with time step Δ, these yield: $h_t = \overline{A} h_{t-1} + \overline{B} x_t, \qquad y_t = C h_t + D x_t$ where $\overline{A} = \exp(\Delta A)$ , $\overline{B} = (\Delta A)^{-1} (\exp(\Delta A) - I) \Delta B$ (Xu et al., 2024, Ibrahim et al., 11 Feb 2025, Zhu et al., 2024). In Mamba, $\overline{A}, B, C, \Delta$ are further made input-dependent via small neural selectors and gating mechanisms, yielding: $h_t = \overline{A}_t h_{t-1} + B_t\, x_t, \qquad y_t = C_t h_t + D_t x_t$ This input-adaptive recurrence enables selective filtering of visual tokens. Hardware-aware parallel prefix-scan implementations ensure that the scan of visual tokens is $O(N)$ in both compute and memory, as opposed to $O(N^2)$ for transformer-style attention (Liu et al., 2024).

2. Bidirectional and Selective Scanning in Vision Mamba

Unlike NLP-oriented SSMs with strictly causal (unidirectional) scanning, Vision Mamba introduces bidirectional and multi-axis scanning to cover the non-causal, spatial/global dependencies intrinsic to images.

Bidirectional scanning: Each Vision Mamba (ViM) block runs two SSM scans over the token sequence—forward and backward. Each directional SSM is parameterized separately. Gating via learned vectors $z_t$ modulates the contribution from each direction per feature channel:

$\text{output} = \text{linear}\big( y_{\text{forward}} \odot \operatorname{SiLU}(z) + y_{\text{backward}} \odot \operatorname{SiLU}(z) \big)$

The two outputs are merged by a final projection, enabling every token to aggregate context from both spatial directions with linear complexity (Ibrahim et al., 11 Feb 2025, Zhu et al., 2024, Xu et al., 2024).

Cross-scan and hierarchical designs: VMamba (hierarchical variant) applies four SSM scans per block (e.g., horizontal and vertical, forward and backward) fused by lightweight operators (e.g., 1×1 conv), supporting richer global/local modeling (Xu et al., 2024). Hierarchical pyramidal stacking is employed, analogous to CNN backbones, and multi-scale scanning further improves cost-accuracy trade-offs (Shi et al., 2024).
Position encoding: Absolute or implicit position information is supplied via convolutions or explicitly learned embeddings, typically injected at the patch embedding or before SSM processing (Zhu et al., 2024).

3. Scanning Techniques and Architectural Variants

Vision Mamba enables a vast range of scanning mechanisms and architectural variants adapted for different input structures:

Scan Axis / Mode	Examples	Use-case/Impact
1D (raster, zigzag)	H, V, D	Standard image flattening, video, point clouds (Xu et al., 2024, Liu et al., 2024)
Bidirectional	Forward, backward	Captures both past and future context
Multi-axis	H, V, diag	Enriches contextual field, at increased FLOPs
Multi-scale	Full-resolution + downsampled	Long-range dependencies, computational savings (Shi et al., 2024)

PlainMamba: Non-hierarchical, applies SSMs with e.g. zigzag scanning and windowing.
VMamba: Hierarchical, multi-stage, with up/downsampling between stages, and multi-axis scanning per block.
Adaptations for 3D, Sequence, Multimodal: VideoMamba uses spatiotemporal selective scans; PointMamba serializes 3D point clouds before applying SSMs; Fusion-Mamba, VL-Mamba, etc., replace self-attention in multimodal architectures (Xu et al., 2024, Liu et al., 2024).

4. Computational Complexity and Efficiency

The fundamental advantage of Vision Mamba is its ability to model global interactions with linear time and space complexity. Empirical and theoretical results indicate:

Complexity per block: $O(LN)$ for SSMs (L=sequence length, N=hidden size/state dim), compared to $O(L^2)$ for transformers (Liu et al., 2024).
Empirical benchmarks: Vim is up to 2.8× faster and uses 86.8% less GPU memory than DeiT-S at high resolution (1248×1248) (Zhu et al., 2024). FastVim further reduces SSM block depth via spatial pooling, achieving up to 72.5% speedup at 2048×2048 (Kapse et al., 1 Feb 2025).
Token reduction and pruning: Standard ViT token pruning algorithms degrade Mamba performance due to SSM sensitivity to sequence order. Structure-aware methods, e.g., MTR, use the timescale $\Delta_t$ as the per-token importance score, preserving scan order, enabling up to 40% FLOP reduction with ≤1.6% accuracy drop (Ma et al., 18 Jul 2025). Dynamic Vision Mamba (DyVM) combines token pruning via rearrangement and per-image dynamic block skipping, achieving ~35% FLOPs reduction with ≤2% performance loss (Wu et al., 7 Apr 2025).

5. Empirical Results and Applications

Vision Mamba models systematically match or surpass transformer and CNN baselines across an array of vision benchmarks:

Task	Model/Setting	Top-1 / Metric	Params/FLOPs	Reference
ImageNet-1K classification	VMamba-S (hierarchical)	83.6%	44M/11.2G	(Xu et al., 2024)
	Vim-S (bidirectional SSM)	81.2%	22M/4.3G	(Zhu et al., 2024)
	FastVim-S	81.1%	26M/4.4G	(Kapse et al., 1 Feb 2025)
Object detection (COCO)	VMamba-B + Mask-RCNN	49.2 AP^b / 43.9 AP^m	108M/485G	(Xu et al., 2024)
Semantic segmentation	VMamba-B + UperNet (ADE20K)	51.0 mIoU	–	(Xu et al., 2024)
Video understanding	VideoMamba (Kinetics)	matches ViViT/TimeSformer	>5× speed	(Liu et al., 2024)
Medical image classification	MedMamba-S (CPN X-ray)	97.3% OA / 0.997 AUC	23.5M/3.5G	(Yue et al., 2024)
Multimodal UAV detection	UAVD-Mamba (DroneVehicle)	83.0% mAP (+3.6 OAFA)	39.7M/38.9G	(Li et al., 1 Jul 2025)

Autoregressive pretraining (ARM) unlocks scaling to huge model sizes: ARM-H (662M params) achieves 85.0% on ImageNet with stable convergence, outperforming supervised and MAE-pretrained Mamba variants (Ren et al., 2024).

In 3D and scientific domains, Vision Mamba outperforms transformer and CNN baselines in permeability prediction of 3D porous media while using 13× fewer parameters and 65% lower GPU memory (Kashefi et al., 16 Oct 2025). For echocardiographic segmentation, MSV-Mamba improves Dice scores on EchoNet-Dynamic and CAMUS (Yang et al., 13 Jan 2025).

6. Limitations, Challenges, and Future Directions

Vision Mamba architectures pose several open questions and current limitations:

Stability at scale: Very deep Mamba stacks can encounter vanishing/exploding gradients, limiting scaling compared to some transformer configurations (Xu et al., 2024).
Scan order and spatial generality: Optimal 2D/3D scan order remains empirical; learned or adaptive scanning is a proposed direction (Liu et al., 2024, Rahman et al., 2024).
Interpretability and robustness: The “hidden attention” matrix of SSMs is less interpretable than attention maps; work is ongoing to adapt explainability tools (Liu et al., 2024, Rahman et al., 2024).
Computational redundancy: Multi-directional scanning inflates FLOPs; new multi-scale or windowed strategies trade off local/global context and efficiency (Shi et al., 2024).
Transfer and domain generalization: Domain shifts can affect SSM parameter robustness and generalization (Xu et al., 2024).
Resource availability: Fewer large-scale public Vision Mamba checkpoints exist compared to ViTs; community adoption remains in progress (Rahman et al., 2024).

Key research frontiers include hardware-aware kernels, hybridizing state-space with attention and/or convolutions, learned scanning, domain adaptation, and ultra-large foundational pretraining (Ibrahim et al., 11 Feb 2025, Xu et al., 2024, Liu et al., 2024, Rahman et al., 2024).

7. Significance and Outlook in Computer Vision

Vision Mamba has rapidly evolved into a versatile backbone for general vision, multimodal fusion, scientific imaging, and edge deployment scenarios. Its efficient global context modeling, dynamic input-aware computation, and amenability to hybridization with domain-specific operations position it as a foundational alternative to transformers and CNNs, especially as model and data scale continue to grow. As the taxonomy of variants and practical deployments expands, and as further evidence accumulates on its scaling, robustness, and efficiency characteristics, Vision Mamba is poised to become a central component of next-generation visual AI systems (Ibrahim et al., 11 Feb 2025, Xu et al., 2024, Liu et al., 2024, Shi et al., 2024, Kapse et al., 1 Feb 2025, Ren et al., 2024).

References:

(Ibrahim et al., 11 Feb 2025, Xu et al., 2024, Zhu et al., 2024, Liu et al., 2024, Kapse et al., 1 Feb 2025, Yue et al., 2024, Wang et al., 2024, Ma et al., 18 Jul 2025, A et al., 2024, Kashefi et al., 16 Oct 2025, Ren et al., 2024, Li et al., 1 Jul 2025, Rahman et al., 2024, Shi et al., 2024, Yang et al., 13 Jan 2025, Zhu et al., 2024, Nasiri-Sarvi et al., 2024, Chen et al., 2024)

Markdown Upgrade to Chat

References (19)

A Survey on Mamba Architecture for Vision Applications (2025)

Visual Mamba: A Survey and New Outlooks (2024)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (2024)

Vision Mamba: A Comprehensive Survey and Taxonomy (2024)

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model (2024)

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing (2025)

Training-free Token Reduction for Vision Mamba (2025)

Dynamic Vision Mamba (2025)

MedMamba: Vision Mamba for Medical Image Classification (2024)

10.

UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection (2025)

11.

Autoregressive Pretraining with Mamba in Vision (2024)

12.

Vision Mamba for Permeability Prediction of Porous Media (2025)

13.

MSV-Mamba: A Multiscale Vision Mamba Network for Echocardiography Segmentation (2025)

14.

Mamba in Vision: A Comprehensive Survey of Techniques and Applications (2024)

15.

GlobalMamba: Global Image Serialization for Vision Mamba (2024)

16.

Vision Mamba: Cutting-Edge Classification of Alzheimer's Disease with 3D MRI Scans (2024)

17.

MSCrackMamba: Leveraging Vision Mamba for Crack Detection in Fused Multispectral Imagery (2024)

18.

Vision Mamba for Classification of Breast Ultrasound Images (2024)

19.

Vision Mamba Distillation for Low-resolution Fine-grained Image Classification (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Mamba.

Vision Mamba: Efficient Global Context Modeling

1. Mathematical and Algorithmic Foundations

2. Bidirectional and Selective Scanning in Vision Mamba

3. Scanning Techniques and Architectural Variants

4. Computational Complexity and Efficiency

5. Empirical Results and Applications

6. Limitations, Challenges, and Future Directions

7. Significance and Outlook in Computer Vision

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Vision Mamba: Efficient Global Context Modeling

1. Mathematical and Algorithmic Foundations

2. Bidirectional and Selective Scanning in Vision Mamba

3. Scanning Techniques and Architectural Variants

4. Computational Complexity and Efficiency

5. Empirical Results and Applications

6. Limitations, Challenges, and Future Directions

7. Significance and Outlook in Computer Vision

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research