Papers
Topics
Authors
Recent
2000 character limit reached

Vision Mamba (ViM) Backbone

Updated 9 December 2025
  • Vision Mamba (ViM) is a vision backbone that replaces self-attention with selective, bidirectional state-space mechanisms for efficient global context modeling.
  • ViM processes images by partitioning them into patches and applying token-dependent SSM scans, achieving hardware-efficient, linear-complexity inference with notable speed-ups.
  • Adaptations of ViM span 2D, volumetric, and spherical data, demonstrating state-of-the-art accuracy on benchmarks like ImageNet and COCO while offering significant memory savings.

Vision Mamba (ViM) is a state-space–model (SSM)–based vision backbone that dispenses entirely with self-attention and instead models visual data using selective, bidirectional state-space mechanisms. By leveraging SSMs, ViM achieves hardware-efficient, linear-complexity global context modeling across high-resolution images, large-scale datasets, and diverse geometries. Variants and adaptations now span conventional 2D images, volumetric data, spherical manifolds, and real-time embedded tasks, establishing Vision Mamba as a principal alternative to Transformer-based backbones.

1. Bidirectional State-Space Model Foundations

The core of Vision Mamba is the selective, bidirectional SSM block, which generalizes classic RNNs by conditioning its transition, update, and output parameters on input tokens while retaining global, time-invariant state matrices. At each time step tt, the continuous-time SSM is given by: dh(t)dt=Ah(t)+Bx(t),y(t)=Ch(t)\frac{d h(t)}{dt} = A h(t) + B x(t), \quad y(t) = C h(t) where h(t)h(t) is the hidden state, x(t)x(t) the input, and y(t)y(t) the output. Discretization (using zero-order hold) yields updates: ht+1=Aˉht+Bˉxt,yt=Chth_{t+1} = \bar{A} h_t + \bar{B} x_t,\quad y_t = C h_t with Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A) and Bˉ\bar{B} determined by discrete integration. In Mamba (and thus ViM), the matrices Bt,CtB_t, C_t and the step size Δt\Delta_t are token-dependent, produced by learned projections of each input xtx_t, whereas AA remains a learnable global parameter.

Bidirectionality in ViM is achieved by running two parallel scans across the input sequence (tokens derived from image patches): one forward (t=1Nt=1 \rightarrow N), one backward (t=N1t=N \rightarrow 1). Outputs from both scans are fused, either additively or via learned projections, allowing each token to aggregate global information from the entire image context in O(Nd)O(Nd) time, where NN is token count and dd is hidden/state dimension (Zhu et al., 17 Jan 2024, Nasiri-Sarvi et al., 4 Jul 2024, Nasiri-Sarvi et al., 20 Apr 2024, He et al., 24 Jan 2025).

2. Patch Embedding, Position Encoding, and Block Structure

ViM processes 2D images by partitioning them into non-overlapping P×PP \times P patches, each flattened and projected to a dd-dimensional embedding via a linear layer. Position information is encoded using learnable vectors EposRN×dE_{\mathrm{pos}} \in \mathbb{R}^{N \times d}: zi=Linear(xi)+Epos[i]z_i = \mathrm{Linear}(x_i) + E_{\mathrm{pos}}[i] with N=(H/P)×(W/P)N=(H/P) \times (W/P) for image size H×WH \times W. The resulting sequence is prepended with a class token and processed through LL cascaded bidirectional Mamba blocks, each consisting of:

  • Input normalization (e.g., LayerNorm),
  • Parallel forward and backward SSM scans with token-wise, input-adaptive parameters,
  • Output gating and fusion,
  • Residual addition,
  • An MLP (positionwise feed-forward network) with activation.

Each Mamba block enables both short-range (local, via the input-dependent B,CB,C) and long-range (global, via the bidirectional scan) dependencies. Patch-merging for spatial compression is supported via strided convolutions (e.g., 2×22 \times 2 with stride 2), enabling hierarchical architectures (Zhu et al., 17 Jan 2024, Nasiri-Sarvi et al., 4 Jul 2024, Ibrahim et al., 11 Feb 2025, Nasiri-Sarvi et al., 20 Apr 2024, Lai et al., 29 Oct 2024).

3. Scaling Laws, Complexity, and Empirical Performance

ViM fully eliminates the O(N2)O(N^2) time and space bottleneck of Transformer attention, scaling linearly with sequence length (O(Nd)O(Nd)). Practical hardware speed-ups are realized via diagonalization of the SSM state matrix AA and FFT-based implementation of global convolutions. Batch inference in ViM is 2.8×2.8\times faster and 86.8%86.8\% more memory efficient than DeiT on 1248×\times1248 input images (Zhu et al., 17 Jan 2024).

ViM outperforms or matches strong ViT baselines across:

4. Adaptations: Surfaces, 3D, Frequency, Efficiency

ViM generalizes to non-Euclidean and higher-dimensional data:

  • Surface Vision Mamba (SiM): For spherical cortical surfaces, data is partitioned into triangular patches via icosphere subdivision. The token sequence is processed bidirectionally, yielding up to 4.8×4.8\times faster inference and 91.7%91.7\% memory savings relative to attention-based baselines (He et al., 24 Jan 2025).
  • MobileViM: In 3D medical imaging, MobileViM implements "dimension-independent" SSM traversal—axis-wise passes (D/H/W) with dual directionality. Cross-scale bridging ensures high spatial detail at deep levels, achieving state-of-the-art Dice scores at over 90 FPS with sub-7M parameter count (Dai et al., 19 Feb 2025).
  • Frequency-domain enhancements: Vim-F fuses amplitude spectra from FFTs of the spatial domain with patch embeddings, eliminating the need for position embeddings and recovering locality lost in 1D flattening. Top-1 ImageNet gains of up to +1.3%+1.3\% are reported (Zhang et al., 29 May 2024).
  • Hybrid designs (TinyViM): Frequency-decoupled Laplace mixers direct SSM modeling toward low-frequency features, conserving compute while mobile-friendly convolutions capture high-frequency content. Dynamic frequency allocation across network depth further boosts classification and dense prediction accuracy (Ma et al., 26 Nov 2024).

5. Efficient Training, Pruning, and Deployment

ViM has been extended with domain-specific efficiency techniques:

  • Token fusion / pruning: Cross-layer cosine-similarity–based fusion (Famba-V) reduces computation and memory by up to 28%28\% with <0.5%<0.5\% accuracy loss if applied to upper layers only (Shen et al., 15 Sep 2024). Post-hoc merging plus retraining (R-MeeTo) restores up to $44$ points of accuracy lost to aggressive token reduction in <20<20 minutes of fine-tuning for large ViMs (Shi et al., 17 Dec 2024).
  • Coarse-to-fine inference: Adaptive patch granularity (CF-ViM) processes simple images at coarse resolutions, selectively re-processing regions at finer scales for inputs with low confidence. This strategy achieves  47%~47\% FLOPs savings while preserving baseline accuracy (Liu et al., 29 Nov 2025).
  • Vector quantization: ViM-VQ quantizes Mamba weights to 1-3 bits via convex-combination codebooks and incremental hardening, shrinking model size 15×\sim15\times for edge deployment with minimal accuracy loss (<<1.5 pp on ImageNet, COCO, etc.) (Deng et al., 12 Mar 2025).
  • Regularization: Stochastic Layer-Wise Shuffle (SLWS) permutes tokens with layer-depth–dependent probability during training, curbing overfitting and boosting sturdy scaling up to hundreds of millions of parameters (Huang et al., 30 Aug 2024).

6. Applications: Medical Imaging, Low-Data, Manifold Geometry

ViM and its derivatives have demonstrated:

7. Open Directions and Limitations

While ViM has established itself as a principal SSM backbone, key limitations and research frontiers include:

ViM research thus intersects major axes in modern vision modeling: linear-complexity global sequence modeling, generalization to arbitrary manifolds/dimensions, training efficiency, and real-world deployment. Its empirical success across classification, detection, segmentation, and medical domains signals a fundamental shift away from attention-centered paradigms, with diverse ongoing extensions (Zhu et al., 17 Jan 2024, Nasiri-Sarvi et al., 20 Apr 2024, He et al., 24 Jan 2025, Dai et al., 19 Feb 2025, Deng et al., 12 Mar 2025, Shi et al., 17 Dec 2024, Yao et al., 12 Dec 2024, Lai et al., 29 Oct 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vision Mamba (Vim).