VMamba: Efficient Visual SSM Backbone

Updated 22 February 2026

VMamba is a family of vision backbones that use selective state space models to achieve global contextual modeling with linear computational complexity.
It integrates multi-directional selective scanning and adaptive state updates to efficiently process high-resolution and long-sequence visual data.
VMamba offers interpretability through controllability analysis, delivering fine-grained diagnostic insights and robust performance across diverse vision tasks.

Visual Mamba (VMamba) is a family of vision backbones that applies selective State Space Models (SSMs), specifically the Mamba architecture, to visual sequence modeling. VMamba achieves global contextual modeling with linear computational and memory complexity, providing an efficient alternative to Vision Transformers (ViTs), especially for high-resolution and long-sequence vision data. Its design incorporates structured recurrent dynamics, multi-directional scanning, input-adaptive state updates, and enables physically-motivated interpretability through controllability analysis.

1. Selective State Space Modeling for Visual Sequences

VMamba extends the Mamba SSM paradigm from 1D sequential data to 2D/3D visual signals by viewing patchified images as sequences and employing parameter-efficient selective SSMs at the core of every block. For an image partitioned into $L$ tokens (typically patches), each layer $k$ employs the following discrete SSM recurrence: $x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ where the transition, input, and output matrices $\{\bar{A}_k, \bar{B}_k, \bar{C}_k\}$ are input-dependent and dynamically generated from $u_k$ (the token or patch at position $k$ ). This input-conditioned parameterization distinguishes VMamba from classical SSMs (e.g., S4), enabling content-aware adaptation of memory and focus within the global recurrence (Mabrok et al., 16 Nov 2025).

The SSM matrices are computed via: $\bar{A}_k = \exp(\Delta_k A_k), \quad \bar{B}_k = (\Delta_k A_k)^{-1}(\bar{A}_k - I)\,(\Delta_k B_k)$ with $\Delta_k$ a learnable, possibly input-dependent stepsize.

VMamba replaces the $O(L^2)$ self-attention mechanism of ViTs with this $O(LN)$ SSM recurrence ( $k$ 0: state dimension), yielding strictly linear scaling in both computational and memory footprint (Mabrok et al., 16 Nov 2025, Liu et al., 2024). The final prediction is typically generated via a global average pool over output embeddings, followed by a classifier head.

2. Multi-directional Selective Scanning and Architectural Instantiations

To address the intrinsic nonsequential structure of images, VMamba incorporates multi-directional selective scans. The canonical SS2D (“2D-Selective-Scan”) module unfolds a 2D feature map into multiple 1D scan sequences along different spatial directions (e.g., left-right, right-left, top-bottom, bottom-top). Each sequence is processed by its own SSM block, and their outputs are merged back to the 2D grid:

For each scan direction $k$ 1, the patch sequence $k$ 2 is constructed according to specific spatial traversal orders (Liu et al., 2024).
Four-direction cross-scanning captures global context and isotropic receptive field coverage.
Downsampling, patch-merging, and hierarchical design further promote multi-scale feature extraction (similar to CNN/ViT backbones).

VMamba variants span classification, segmentation, detection, and regression heads. Encoder-decoder designs (e.g., VM-UNet, Mamba-UNet) employ patch embedding, multi-stage downsampling, and mirrored upsampling, with skip connections fusing hierarchical representations (Chen et al., 2024, Wang et al., 2024).

Block-level architectural specifics include:

Depthwise convolution and MLP mixing for local channel interaction.
No explicit positional encoding (spatial order is embedded by the scanning pattern).
Residual normalization and gating for stability and efficient training.

The block parameterization and FLOPs are tightly controlled. For a block with embedding dimension $k$ 3 and state $k$ 4, typical parameter count per SS2D block is $k$ 5– $k$ 6, significantly smaller than ViT's per-block quadratic scaling (Chen et al., 2024).

3. Interpretability via Controllability-based Analysis

Unlike Transformers, VMamba does not provide explicit attention maps; consequently, classical attribution tools are not directly applicable. To address this, X-VMamba (Mabrok et al., 16 Nov 2025) introduces a controllability-based interpretability framework, quantifying the influence of input tokens on output predictions:

Jacobian-based Influence: For general SSM architectures, computes

$k$ 7

across all $k$ 8, summing the Frobenius norms to produce an InfluenceScore $k$ 9 for every input token $x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ 0. Direct and propagated influence terms provide insight into both immediate and decayed contributions.

Gramian-based Influence: For diagonal SSMs, uses the closed-form controllability Gramian

$x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ 1

weighted by observability $x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ 2, efficiently yielding saliency maps with a single forward pass.

This framework is universally applicable to SSM-based models (not requiring architectural changes or manual tuning) and operates with linear cost in sequence length. Empirical studies reveal that VMamba hierarchically transitions from distributed low-level (e.g., texture) attention to sparse, localized focus in deep layers. Influence distributions contract from uniform in shallow blocks to sharp peaks concentrated on semantically important patches (e.g., lesion boundaries in medical images) (Mabrok et al., 16 Nov 2025).

The controllability perspective reveals that selective scanning/parameterization encodes inductive biases akin to both convolutional locality and attention-like global selection, offering fine-grained diagnostic explanations in contexts such as medical imaging.

4. Applications and Empirical Validation

VMamba has demonstrated competitive or superior performance across visual domains, achieving state-of-the-art results on standard vision tasks:

Image classification: On ImageNet-1K, VMamba-S and VMamba-B reach 83.6% and 83.9% top-1 accuracy, respectively, matching or exceeding Swin Transformers and ConvNeXts at comparable parameter/FLOPs budgets (Liu et al., 2024).
Segmentation: On ADE20K, VMamba-T achieves 48.3% mIoU, VMamba-S 50.6%, both at significantly lower complexity than ViT/Swin-based models (Liu et al., 2024, Chen et al., 2024).
Medical imaging: In breast ultrasound, VMamba outperforms all CNN and ViT baselines in accuracy/AUC, with statistical significance (Nasiri-Sarvi et al., 2024). VMamba-based encoder-decoders (VM-UNet, Mamba-UNet, Weak-Mamba-UNet) consistently surpass U-Net and Transformer hybrids in multi-organ and cardiac segmentation with better Dice, surface distance, and sample efficiency (Chen et al., 2024, Wang et al., 2024, Wang et al., 2024).
Efficient high-resolution inference: VMamba exhibits graceful throughput/accuracy trade-offs as input size grows; FLOPs scale linearly, unlike quadratic degradation in attention-based models (Liu et al., 2024).
Regression and 3D data: In volumetric permeability prediction, VMamba realizes a 13 $x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ 3 reduction in parameters and strict linear memory scaling relative to both ViT and CNN backbones, without loss in accuracy (Kashefi et al., 16 Oct 2025).
Crowd counting: VMambaCC integrates high-level and low-level features efficiently, yielding state-of-the-art MAE/MSE and F1 on dense counting datasets (Ma et al., 2024).

Applications extend into remote sensing, video, multi-modal fusion, and more, as surveyed in (Zhang et al., 2024, Liu et al., 2024, Ibrahim et al., 11 Feb 2025, Xu et al., 2024).

5. Extensions, Innovations, and Derivative Architectures

VMamba provides a versatile basis for both foundational and hybrid designs:

Multi-scale and multi-head scanning: MSVMamba and MHS-VM (Multi-Head Scan) introduce hierarchical scanning across full and downsampled resolutions, as well as parallel scanning in distinct subspaces. These approaches reduce redundancy in multi-directional scanning, preserving global context while controlling computational cost (Shi et al., 2024, Ji, 2024).
Local-global hybridization: LoG-VMamba complements global SSM context with explicit local token grouping and lightweight global token extraction, achieving SOTA results in both 2D/3D medical segmentation and substantially reducing FLOPs relative to classical VMamba (Dang et al., 2024).
Post-training quantization: PTQ4VM addresses quantization bottlenecks for hardware deployment, introducing per-token static quantization and joint step size optimization, enabling <0.5% top-1 loss at INT8 and up to 1.83 $x_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k, \quad y_k = \bar{C}_k x_k$ 4 GPU speedup (Cho et al., 2024).
Multi-modal and task-specific adaptation: VMamba derivatives (VideoMamba, Panel-Mamba, Weak-Mamba-UNet, etc.) successfully extend the SSM paradigm to spatiotemporal, multimodal, and semi-supervised regimes.

A concise taxonomy is presented below:

VMamba Variant	Key Innovation	Target Tasks
SS2D/VMamba	4-way 2D selective scans, hierarchical backbone	Classification, detection, segm.
MSVMamba	Multi-scale scanning, ConvFFN	High-res, efficient dense tasks
LoG-VMamba	Local & global token extraction, single scan	2D/3D medical segmentation
MHS-VM	Multi-head, parallel scans, scan route attention	Small-data, efficiency-critical
PTQ4VM	Post-training quantization epistemics	Hardware adaptation
Weak-Mamba-UNet	Multi-view semi-supervised pseudo-labeling	Weakly-sup. medical segmentation

6. Limitations, Open Challenges, and Future Directions

While VMamba consistently demonstrates strong accuracy–efficiency trade-offs, several conceptual and practical challenges remain:

Scan design: The optimal traversal order, composition of scan axes, positional encoding, and redundancy control in multi-directional scanning remain open (Xu et al., 2024).
Spatial inductive bias: Flattening 2D/3D grids for single-directional SSMs may disrupt local spatial correlations; hybrid models and learned scan patterns address but do not fully solve this problem (Zhang et al., 2024, Shi et al., 2024).
Lack of explicit attention: Interpretability tools are less mature for SSMs (addressed in part by controllability frameworks (Mabrok et al., 16 Nov 2025)), and the relation to Transformer-style attention is not fully formalized.
Scalability and generalization: Large-scale self-supervised pretraining, domain adaptation, adversarial robustness, and principled state-space scaling laws require deeper study (Xu et al., 2024, Liu et al., 2024).
Broader adoption and benchmarking: The relative paucity of pretrained weights and limited integration in mainstream vision libraries have so far constrained broad adoption compared to CNN and ViT families.

Ongoing work explores N-dimensional SSMs, adaptive multi-scale scan patterns, efficient fusion with CNNs/Transformers, and scalable multimodal architectures (Xu et al., 2024, Wang et al., 2024, Korcyl et al., 2024).

VMamba represents a fundamentally new regime in vision architectures, synthesizing the memory-efficient, global receptive field of SSMs with content-aware selectivity and hierarchical vision-specific engineering. Its combination of theoretical tractability, engineering efficiency, and interpretability makes it a foundational tool for both discriminative and generative visual modeling (Mabrok et al., 16 Nov 2025).