Papers
Topics
Authors
Recent
2000 character limit reached

Vision Mamba (ViM) Backbone

Updated 9 December 2025
  • Vision Mamba (ViM) is a vision backbone that replaces self-attention with selective, bidirectional state-space mechanisms for efficient global context modeling.
  • ViM processes images by partitioning them into patches and applying token-dependent SSM scans, achieving hardware-efficient, linear-complexity inference with notable speed-ups.
  • Adaptations of ViM span 2D, volumetric, and spherical data, demonstrating state-of-the-art accuracy on benchmarks like ImageNet and COCO while offering significant memory savings.

Vision Mamba (ViM) is a state-space–model (SSM)–based vision backbone that dispenses entirely with self-attention and instead models visual data using selective, bidirectional state-space mechanisms. By leveraging SSMs, ViM achieves hardware-efficient, linear-complexity global context modeling across high-resolution images, large-scale datasets, and diverse geometries. Variants and adaptations now span conventional 2D images, volumetric data, spherical manifolds, and real-time embedded tasks, establishing Vision Mamba as a principal alternative to Transformer-based backbones.

1. Bidirectional State-Space Model Foundations

The core of Vision Mamba is the selective, bidirectional SSM block, which generalizes classic RNNs by conditioning its transition, update, and output parameters on input tokens while retaining global, time-invariant state matrices. At each time step tt, the continuous-time SSM is given by: dh(t)dt=Ah(t)+Bx(t),y(t)=Ch(t)\frac{d h(t)}{dt} = A h(t) + B x(t), \quad y(t) = C h(t) where h(t)h(t) is the hidden state, x(t)x(t) the input, and y(t)y(t) the output. Discretization (using zero-order hold) yields updates: ht+1=Aˉht+Bˉxt,yt=Chth_{t+1} = \bar{A} h_t + \bar{B} x_t,\quad y_t = C h_t with Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A) and Bˉ\bar{B} determined by discrete integration. In Mamba (and thus ViM), the matrices Bt,CtB_t, C_t and the step size Δt\Delta_t are token-dependent, produced by learned projections of each input xtx_t, whereas AA remains a learnable global parameter.

Bidirectionality in ViM is achieved by running two parallel scans across the input sequence (tokens derived from image patches): one forward (t=1Nt=1 \rightarrow N), one backward (t=N1t=N \rightarrow 1). Outputs from both scans are fused, either additively or via learned projections, allowing each token to aggregate global information from the entire image context in O(Nd)O(Nd) time, where NN is token count and dd is hidden/state dimension (Zhu et al., 2024, Nasiri-Sarvi et al., 2024, Nasiri-Sarvi et al., 2024, He et al., 24 Jan 2025).

2. Patch Embedding, Position Encoding, and Block Structure

ViM processes 2D images by partitioning them into non-overlapping P×PP \times P patches, each flattened and projected to a dd-dimensional embedding via a linear layer. Position information is encoded using learnable vectors EposRN×dE_{\mathrm{pos}} \in \mathbb{R}^{N \times d}: zi=Linear(xi)+Epos[i]z_i = \mathrm{Linear}(x_i) + E_{\mathrm{pos}}[i] with N=(H/P)×(W/P)N=(H/P) \times (W/P) for image size H×WH \times W. The resulting sequence is prepended with a class token and processed through LL cascaded bidirectional Mamba blocks, each consisting of:

  • Input normalization (e.g., LayerNorm),
  • Parallel forward and backward SSM scans with token-wise, input-adaptive parameters,
  • Output gating and fusion,
  • Residual addition,
  • An MLP (positionwise feed-forward network) with activation.

Each Mamba block enables both short-range (local, via the input-dependent B,CB,C) and long-range (global, via the bidirectional scan) dependencies. Patch-merging for spatial compression is supported via strided convolutions (e.g., 2×22 \times 2 with stride 2), enabling hierarchical architectures (Zhu et al., 2024, Nasiri-Sarvi et al., 2024, Ibrahim et al., 11 Feb 2025, Nasiri-Sarvi et al., 2024, Lai et al., 2024).

3. Scaling Laws, Complexity, and Empirical Performance

ViM fully eliminates the O(N2)O(N^2) time and space bottleneck of Transformer attention, scaling linearly with sequence length (O(Nd)O(Nd)). Practical hardware speed-ups are realized via diagonalization of the SSM state matrix AA and FFT-based implementation of global convolutions. Batch inference in ViM is 2.8×2.8\times faster and 86.8%86.8\% more memory efficient than DeiT on 1248×\times1248 input images (Zhu et al., 2024).

ViM outperforms or matches strong ViT baselines across:

  • ImageNet-1K (ViM-Small: 81.5% top-1 vs. DeiT-Small 79.8%);
  • COCO object detection and ADE20k segmentation (e.g., ViM-S: 45.6 mIoU vs. DeiT-Small 41.8)(Zhu et al., 2024);
  • Medical and specialized domains (e.g., Camelyon16 histopathology: Vim-ti AUC 95.81 vs. ViT-ti 87.60) (Nasiri-Sarvi et al., 2024). Extended studies report statistically significant improvements in limited-data transfer-learning settings (Nasiri-Sarvi et al., 2024, Lai et al., 2024).

4. Adaptations: Surfaces, 3D, Frequency, Efficiency

ViM generalizes to non-Euclidean and higher-dimensional data:

  • Surface Vision Mamba (SiM): For spherical cortical surfaces, data is partitioned into triangular patches via icosphere subdivision. The token sequence is processed bidirectionally, yielding up to 4.8×4.8\times faster inference and 91.7%91.7\% memory savings relative to attention-based baselines (He et al., 24 Jan 2025).
  • MobileViM: In 3D medical imaging, MobileViM implements "dimension-independent" SSM traversal—axis-wise passes (D/H/W) with dual directionality. Cross-scale bridging ensures high spatial detail at deep levels, achieving state-of-the-art Dice scores at over 90 FPS with sub-7M parameter count (Dai et al., 19 Feb 2025).
  • Frequency-domain enhancements: Vim-F fuses amplitude spectra from FFTs of the spatial domain with patch embeddings, eliminating the need for position embeddings and recovering locality lost in 1D flattening. Top-1 ImageNet gains of up to +1.3%+1.3\% are reported (Zhang et al., 2024).
  • Hybrid designs (TinyViM): Frequency-decoupled Laplace mixers direct SSM modeling toward low-frequency features, conserving compute while mobile-friendly convolutions capture high-frequency content. Dynamic frequency allocation across network depth further boosts classification and dense prediction accuracy (Ma et al., 2024).

5. Efficient Training, Pruning, and Deployment

ViM has been extended with domain-specific efficiency techniques:

  • Token fusion / pruning: Cross-layer cosine-similarity–based fusion (Famba-V) reduces computation and memory by up to 28%28\% with <0.5%<0.5\% accuracy loss if applied to upper layers only (Shen et al., 2024). Post-hoc merging plus retraining (R-MeeTo) restores up to $44$ points of accuracy lost to aggressive token reduction in <20<20 minutes of fine-tuning for large ViMs (Shi et al., 2024).
  • Coarse-to-fine inference: Adaptive patch granularity (CF-ViM) processes simple images at coarse resolutions, selectively re-processing regions at finer scales for inputs with low confidence. This strategy achieves  47%~47\% FLOPs savings while preserving baseline accuracy (Liu et al., 29 Nov 2025).
  • Vector quantization: ViM-VQ quantizes Mamba weights to 1-3 bits via convex-combination codebooks and incremental hardening, shrinking model size 15×\sim15\times for edge deployment with minimal accuracy loss (<<1.5 pp on ImageNet, COCO, etc.) (Deng et al., 12 Mar 2025).
  • Regularization: Stochastic Layer-Wise Shuffle (SLWS) permutes tokens with layer-depth–dependent probability during training, curbing overfitting and boosting sturdy scaling up to hundreds of millions of parameters (Huang et al., 2024).

6. Applications: Medical Imaging, Low-Data, Manifold Geometry

ViM and its derivatives have demonstrated:

  • Robustness to data scarcity and improved generalization in medical imaging datasets (histopathology, ultrasound, brain MRI), often significantly narrowing the accuracy-parameter gap versus ViT and CNNs (Nasiri-Sarvi et al., 2024, Nasiri-Sarvi et al., 2024, Lai et al., 2024);
  • Real-time segmentation on 3D medical data (MobileViM), efficient spherical data modeling (SiM), and high-accuracy transfer learning to medical classification tasks;
  • End-to-end explainability (e.g., Grad-CAM on ViM CLS tokens reflects pathologist spatial navigation in histopathology) (Nasiri-Sarvi et al., 2024);
  • State-of-the-art low-resolution fine-grained classification (ViMD), fusing super-resolution ViM-tiny students and multi-level distillation from HR ViM-teacher, with compact models for embedded deployment (Chen et al., 2024);
  • Selective visual prompting (SVP) for ViM provides task-efficient adaptation and outperforms classical visual prompts from the ViT literature (Yao et al., 2024).

7. Open Directions and Limitations

While ViM has established itself as a principal SSM backbone, key limitations and research frontiers include:

ViM research thus intersects major axes in modern vision modeling: linear-complexity global sequence modeling, generalization to arbitrary manifolds/dimensions, training efficiency, and real-world deployment. Its empirical success across classification, detection, segmentation, and medical domains signals a fundamental shift away from attention-centered paradigms, with diverse ongoing extensions (Zhu et al., 2024, Nasiri-Sarvi et al., 2024, He et al., 24 Jan 2025, Dai et al., 19 Feb 2025, Deng et al., 12 Mar 2025, Shi et al., 2024, Yao et al., 2024, Lai et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Mamba (Vim).