EfficientVMamba: Multi-Scale Visual SSMs

Updated 22 February 2026

EfficientVMamba is a visual backbone architecture that integrates state-space models with multi-scale selective scanning to reduce computational overhead while maintaining accuracy.
Its design employs one full-resolution scan with three half-resolution scans, reducing FLOPs by 2.3× compared to traditional VMamba architectures.
Incorporating a lightweight ConvFFN for enhanced cross-channel mixing, it achieves competitive performance on ImageNet, COCO, and ADE20K benchmarks.

EfficientVMamba refers to a class of visual backbone architectures that combine state space models (SSMs) with algorithmic and architectural innovations to maximize computational efficiency, especially for vision tasks where large spatial resolution and global context are required. The term is most precisely associated with the “Multi-Scale Vision Mamba” (MSVMamba) approach, which implements a multi-scale selective scan mechanism to reduce compute and memory footprint relative to earlier SSM-based visual backbones, while achieving state-of-the-art accuracy on standard vision benchmarks (Shi et al., 2024). Recent literature, including parallel lines such as “Fast Vision Mamba” (Kapse et al., 1 Feb 2025), further extends this efficiency frontier through hardware-friendly and algorithmic modifications. This entry covers the mathematical foundation, architectural design, complexity analysis, implementation details, and benchmark results that define the EfficientVMamba family within the SSM vision modeling paradigm.

1. Mathematical Foundation: SSMs and the Visual Mamba Block

EfficientVMamba builds upon the Mamba SSM paradigm. The continuous-time SSM evolves a hidden state $h(t)\in\mathbb{R}^N$ under an input drive $x(t)$ ,

$\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$

with matrices $A$ , $B$ , $C$ , $D$ . Discrete-time evolution with step $\Delta$ yields

$h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t + D x_t,$

with $\bar{A} = \exp(\Delta A)$ , $x(t)$ 0. Eliminating the state generates a convolution kernel $x(t)$ 1: $x(t)$ 2 The "S6" or "Mamba" block replaces these parameters with data-dependent versions $x(t)$ 3, $x(t)$ 4, $x(t)$ 5, preserving $x(t)$ 6 computation.

For images, VMamba (Liu et al., 2024) generalizes this to 2D by scanning a flattened feature map along four complementary spatial orders, computing one SSM per route, and aggregating outputs.

2. Multi-Scale 2D Selective Scanning and Architectural Design

MSVMamba/“EfficientVMamba” (Shi et al., 2024) innovates by replacing redundant multi-scan operations in VMamba with an explicit multi-scale design. The core steps are:

Feature Branches: Start from input $x(t)$ 7; construct a downsampled map $x(t)$ 8 with stride $x(t)$ 9.
Selective Scanning: Define scan/order transforms $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 0 ( $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 1). Apply the S6 block on $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 2 (full-res) and shared S6 weights to three scan orders of $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 3 (half-res).
Output Fusing: The outputs $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 4 are reshaped and interpolated, then summed:

$\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 5

This hierarchy, with just one full-res and three half-res scans, preserves long-range dependency learning while lowering complexity by not scanning all directions at full resolution.

A lightweight Convolutional Feed-Forward Network (ConvFFN) addresses the deficiency of token-only mixing in SSMs, adding cross-channel mixing: $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 6 where $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 7, $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 8 are $\frac{dh}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ 9 expansions and projections, and $A$ 0GELU.

3. Computational Complexity and Efficiency Gains

The complexity analysis crucially differentiates EfficientVMamba from its predecessors:

Single-scale VMamba (Liu et al., 2024): Four full-res scans incur $A$ 1 cost per block, with $A$ 2 (spatial tokens), $A$ 3 (channels), $A$ 4 (state dim).
Multi-scale MSVMamba (Shi et al., 2024): Only one full-res scan and three $A$ 5 downsampled scans; the cost per block becomes:

$A$ 6

which represents a $A$ 7 reduction in FLOPs over the original. Kernel memory requirements likewise drop.

ConvFFN overhead is negligible compared to total block cost.

4. Implementation Details and Training Regimen

In the reference implementation (Shi et al., 2024), model scaling and training are as follows:

Model Sizes: E.g., MSVMamba-Tiny: $A$ 8M params, $A$ 9G FLOPs, $B$ 0 images.
Regimen: ImageNet-1K, 300 epochs, AdamW (lr= $B$ 1, $B$ 2 decay, $B$ 3 weight decay), batch size $B$ 4, advanced augmentations, label smoothing $B$ 5.
Inference: Full-res scan in one order; three half-res scans along remaining orders, center crop for classification.
Codebase: All models and detailed specs at https://github.com/YuHengsss/MSVMamba.

5. Empirical Results: Performance Benchmarks

EfficientVMamba achieves highly competitive or superior accuracy to all Vision Mamba variants under comparable or reduced computational cost:

Model	Params	FLOPs	ImageNet Top-1
EffVMamba-T	6M	0.8G	76.5%
EffVMamba-S	11M	1.3G	78.7%
EffVMamba-B	33M	4.0G	81.8%
MSVMamba-Tiny	33M	4.6G	82.8%

On COCO Mask R-CNN (1×, $B$ 6):

EffVMamba-S: 39.3/36.7 (box/mask mAP, 31M params, 197G FLOPs)
MSVMamba-Tiny: 46.9/42.2 (53M params, 252G FLOPs)

On ADE20K UPerNet (semantic segmentation):

EffVMamba-S: 41.5mIoU/42.1mIoU (29M, 505G)
MSVMamba-Tiny: 47.6/48.5 (65M, 942G)

MSVMamba-Tiny improves ImageNet top-1 by 0.6% versus VMamba-T for 17% fewer FLOPs, and similarly outperforms on COCO and ADE20K.

6. Extensions, Limitations, and Impact

EfficientVMamba demonstrates that multi-scale SSMs can supplant expensive multi-directional full-resolution scans without degrading receptive field or accuracy. The ConvFFN addition mitigates the channel mixing bottleneck present in pure SSM backbones. Remaining constraints involve further reducing constant factors in memory and FLOPs, optimizing hardware utilization, and adapting to dense prediction regimes requiring fine spatial detail.

The MSVMamba code and structural choices directly inform a series of follow-on work, including Fast Vision Mamba (Kapse et al., 1 Feb 2025), which applies spatial pooling and parallel scan acceleration; MambaScope (Liu et al., 29 Nov 2025), which introduces coarse-to-fine adaptive tokenization; and various hybrid CNN/SSM architectures for device-critical deployment.

EfficientVMamba sets a new Pareto frontier for throughput-accuracy tradeoffs in visual SSMs, provides the substrate for real-time and large-scale applications, and serves as a crucial reference for the design of next-generation Mamba-like vision models. Its linear-time global modeling paradigm and compositional multi-scale design are now foundational in visual SSM efficiency research (Shi et al., 2024).