Papers
Topics
Authors
Recent
Search
2000 character limit reached

EfficientVMamba: Multi-Scale Visual SSMs

Updated 22 February 2026
  • EfficientVMamba is a visual backbone architecture that integrates state-space models with multi-scale selective scanning to reduce computational overhead while maintaining accuracy.
  • Its design employs one full-resolution scan with three half-resolution scans, reducing FLOPs by 2.3× compared to traditional VMamba architectures.
  • Incorporating a lightweight ConvFFN for enhanced cross-channel mixing, it achieves competitive performance on ImageNet, COCO, and ADE20K benchmarks.

EfficientVMamba refers to a class of visual backbone architectures that combine state space models (SSMs) with algorithmic and architectural innovations to maximize computational efficiency, especially for vision tasks where large spatial resolution and global context are required. The term is most precisely associated with the “Multi-Scale Vision Mamba” (MSVMamba) approach, which implements a multi-scale selective scan mechanism to reduce compute and memory footprint relative to earlier SSM-based visual backbones, while achieving state-of-the-art accuracy on standard vision benchmarks (Shi et al., 2024). Recent literature, including parallel lines such as “Fast Vision Mamba” (Kapse et al., 1 Feb 2025), further extends this efficiency frontier through hardware-friendly and algorithmic modifications. This entry covers the mathematical foundation, architectural design, complexity analysis, implementation details, and benchmark results that define the EfficientVMamba family within the SSM vision modeling paradigm.

1. Mathematical Foundation: SSMs and the Visual Mamba Block

EfficientVMamba builds upon the Mamba SSM paradigm. The continuous-time SSM evolves a hidden state h(t)RNh(t)\in\mathbb{R}^N under an input drive x(t)x(t),

dhdt=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)

with matrices AA, BB, CC, DD. Discrete-time evolution with step Δ\Delta yields

ht=Aˉht1+Bˉxt,yt=Cht+Dxt,h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t + D x_t,

with Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A), BˉΔB\bar{B} \approx \Delta B. Eliminating the state generates a convolution kernel KK: y1:L=x1:LK,K=[CBˉ, CAˉBˉ,,CAˉL1Bˉ]RLy_{1:L} = x_{1:L} * K, \quad K = [C \bar{B},\ C \bar{A} \bar{B},\dots, C \bar{A}^{L-1} \bar{B}] \in \mathbb{R}^L The "S6" or "Mamba" block replaces these parameters with data-dependent versions Aˉi\bar{A}_i, Bˉi\bar{B}_i, CiC_i, preserving O(L)O(L) computation.

For images, VMamba (Liu et al., 2024) generalizes this to 2D by scanning a flattened feature map along four complementary spatial orders, computing one SSM per route, and aggregating outputs.

2. Multi-Scale 2D Selective Scanning and Architectural Design

MSVMamba/“EfficientVMamba” (Shi et al., 2024) innovates by replacing redundant multi-scan operations in VMamba with an explicit multi-scale design. The core steps are:

  • Feature Branches: Start from input Z1RH×W×DZ_1 \in \mathbb{R}^{H\times W\times D}; construct a downsampled map Z2=DWConv3×3,s(Z1)Z_2 = \mathrm{DWConv}_{3\times3,\,s}(Z_1) with stride s=2s=2.
  • Selective Scanning: Define scan/order transforms σk\sigma_k (k=14k=1\ldots4). Apply the S6 block on σ1(Z1)\sigma_1(Z_1) (full-res) and shared S6 weights to three scan orders of Z2Z_2 (half-res).
  • Output Fusing: The outputs YiY_i are reshaped and interpolated, then summed:

Z=γ1(Y1)+Interpolate(γ2(Y2)+γ3(Y3)+γ4(Y4))Z' = \gamma_1(Y_1) + \mathrm{Interpolate}(\gamma_2(Y_2) + \gamma_3(Y_3) + \gamma_4(Y_4))

This hierarchy, with just one full-res and three half-res scans, preserves long-range dependency learning while lowering complexity by not scanning all directions at full resolution.

A lightweight Convolutional Feed-Forward Network (ConvFFN) addresses the deficiency of token-only mixing in SSMs, adding cross-channel mixing: ConvFFN(X)=W2ϕ(DWConv3×3(ϕ(W1X)))\mathrm{ConvFFN}(X) = W_2\,\phi(\mathrm{DWConv}_{3\times3}(\phi(W_1 X))) where W1W_1, W2W_2 are 1×11\times1 expansions and projections, and ϕ=\phi=GELU.

3. Computational Complexity and Efficiency Gains

The complexity analysis crucially differentiates EfficientVMamba from its predecessors:

  • Single-scale VMamba (Liu et al., 2024): Four full-res scans incur O(4LDN)O(4 LDN) cost per block, with L=HWL=H W (spatial tokens), DD (channels), NN (state dim).
  • Multi-scale MSVMamba (Shi et al., 2024): Only one full-res scan and three s=2s=2 downsampled scans; the cost per block becomes:

O(DN[L+3L/4])=O(1.75LDN)O\Bigl(D N [L + 3L/4]\Bigr) = O(1.75\,L D N)

which represents a 2.3×2.3\times reduction in FLOPs over the original. Kernel memory requirements likewise drop.

  • ConvFFN overhead is negligible compared to total block cost.

4. Implementation Details and Training Regimen

In the reference implementation (Shi et al., 2024), model scaling and training are as follows:

  • Model Sizes: E.g., MSVMamba-Tiny: $33$M params, $4.6$G FLOPs, 224×224224\times224 images.
  • Regimen: ImageNet-1K, 300 epochs, AdamW (lr=1e31\mathrm{e}{-3}, cos\cos decay, $0.05$ weight decay), batch size $1024$, advanced augmentations, label smoothing $0.1$.
  • Inference: Full-res scan in one order; three half-res scans along remaining orders, center crop for classification.
  • Codebase: All models and detailed specs at https://github.com/YuHengsss/MSVMamba.

5. Empirical Results: Performance Benchmarks

EfficientVMamba achieves highly competitive or superior accuracy to all Vision Mamba variants under comparable or reduced computational cost:

Model Params FLOPs ImageNet Top-1
EffVMamba-T 6M 0.8G 76.5%
EffVMamba-S 11M 1.3G 78.7%
EffVMamba-B 33M 4.0G 81.8%
MSVMamba-Tiny 33M 4.6G 82.8%

On COCO Mask R-CNN (1×, 1280×8001280\times800):

  • EffVMamba-S: 39.3/36.7 (box/mask mAP, 31M params, 197G FLOPs)
  • MSVMamba-Tiny: 46.9/42.2 (53M params, 252G FLOPs)

On ADE20K UPerNet (semantic segmentation):

  • EffVMamba-S: 41.5mIoU/42.1mIoU (29M, 505G)
  • MSVMamba-Tiny: 47.6/48.5 (65M, 942G)

MSVMamba-Tiny improves ImageNet top-1 by 0.6% versus VMamba-T for 17% fewer FLOPs, and similarly outperforms on COCO and ADE20K.

6. Extensions, Limitations, and Impact

EfficientVMamba demonstrates that multi-scale SSMs can supplant expensive multi-directional full-resolution scans without degrading receptive field or accuracy. The ConvFFN addition mitigates the channel mixing bottleneck present in pure SSM backbones. Remaining constraints involve further reducing constant factors in memory and FLOPs, optimizing hardware utilization, and adapting to dense prediction regimes requiring fine spatial detail.

The MSVMamba code and structural choices directly inform a series of follow-on work, including Fast Vision Mamba (Kapse et al., 1 Feb 2025), which applies spatial pooling and parallel scan acceleration; MambaScope (Liu et al., 29 Nov 2025), which introduces coarse-to-fine adaptive tokenization; and various hybrid CNN/SSM architectures for device-critical deployment.

EfficientVMamba sets a new Pareto frontier for throughput-accuracy tradeoffs in visual SSMs, provides the substrate for real-time and large-scale applications, and serves as a crucial reference for the design of next-generation Mamba-like vision models. Its linear-time global modeling paradigm and compositional multi-scale design are now foundational in visual SSM efficiency research (Shi et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EfficientVMamba.