Vision Mamba: Efficient Visual Backbone

Updated 17 November 2025

Vision Mamba Model is a visual backbone architecture based on structured state space models that leverages input-dependent selective scanning for efficient long-range dependency capture.
It integrates vision-friendly mixer blocks and hybrid token mixing with self-attention to effectively handle spatial and multimodal data in tasks like classification, detection, and segmentation.
Empirical results indicate that Vision Mamba achieves a new Pareto frontier in accuracy versus throughput, outperforming conventional transformers and convolutional networks.

The Vision Mamba family encompasses a class of visual backbone architectures rooted in Structured State Space Models (SSMs), specifically optimized for image, video, and multidimensional data by leveraging linear-time “selective scanning” to capture long-range dependencies. Unlike Transformers with quadratic-complexity attention or convolutional networks with limited receptive field, Vision Mamba architectures use token-dependent SSMs and hardware-aware scan algorithms to deliver a new efficiency frontier in computer vision tasks including classification, detection, segmentation, and multimodal fusion.

1. Mathematical Foundations: Structured State Space and Selective Scan

At the core of Vision Mamba is the continuous-time linear SSM: $h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)$ where $A \in \mathbb{R}^{M \times M}$ , $B \in \mathbb{R}^{1 \times M}$ , and $C \in \mathbb{R}^{1 \times M}$ are learnable parameters, $h(t) \in \mathbb{R}^M$ is the hidden state, and $x(t)$ the input.

Discretization by zero-order hold yields: $\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1} (\exp(\Delta A)-I)\Delta B, \quad \bar{C} = C$

$h[t] = \bar{A} h[t-1] + \bar{B} x[t], \quad y[t] = \bar{C} h[t]$

This discrete recurrence is equivalent to a 1D convolution: $\overline{K} = [C\bar{B},\;C\bar{A}\bar{B},\;\dots,\;C \bar{A}^{T-1} \bar{B}],\qquad y = x * \overline{K}$

Mamba’s innovation is the “selective scan”, replacing static $B, C, \Delta$ with input-dependent selectors (small neural networks), yielding dynamic filtering and strictly linear runtime in sequence length $T$ .

2. Vision-Friendly Redesigns: Mixer Blocks and Hybrid Token Mixing

Direct application of Mamba to vision reveals inadequacies in 1D causal convolution for 2D spatial structure. The MambaVision architecture (Hatamizadeh et al., 10 Jul 2024) introduces the “Vision-friendly Mixer”:

Replaces causal conv with bidirectional depth-wise conv for spatial symmetry.
Adds a parallel non-SSM symmetric conv branch to recover global spatial content.
Splits outputs to $C/2$ channels per branch, concatenates and projects to $C$ .

The mixer formalism: $\begin{aligned} X_1 &= \text{Scan}\left( \mathrm{SiLU}\left( \mathrm{Conv}_{1d}( \mathrm{Lin}_{C \to C/2}(X_{\mathrm{in}}) ) \right) \right) \ X_2 &= \mathrm{SiLU}\left( \mathrm{Conv}_{1d}( \mathrm{Lin}_{C \to C/2}(X_{\mathrm{in}}) ) \right) \ X_{\mathrm{out}} &= \mathrm{Lin}_{C/2 \to C} \left( \mathrm{Concat}[X_1, X_2] \right) \end{aligned} \tag{1}$

Standard vision ablations empirically confirm that regular conv, addition of symmetric branch, and concatenation/project yield substantial accuracy and mask/detection AP improvements over the causal-only baseline.

3. Architectural Design: Hierarchical, Hybrid, and Efficient

MambaVision presents a canonical 4-stage hierarchy:

Stem: Two $3\times3$ convs, stride 2, BN+GELU; output: $(H/4\times W/4\times C_{\rm stem})$
Stages 1–2: Pure residual CNN blocks
Stages 3–4: Each has $N$ layers; first $N/2$ use MambaVision Mixer + MLP, last $N/2$ use Transformer-style self-attention + MLP

Layer update: $\hat X^n = \mathrm{Mixer}(\mathrm{Norm}(X^{n-1})) + X^{n-1}, \qquad X^n = \mathrm{MLP}(\mathrm{Norm}(\hat X^n)) + \hat X^n \tag{2}$

Channel / resolution schedule for “B” variant: $[C=64,\,128,\,320,\,512]$ ; $[56\times56,\,28\times28,\,14\times14,\,7\times7]$ spatial.

Self-attention is introduced only in the final $N/2$ layers per stage, shown in ablation (“Best to put self-attention blocks in the final half”). Multi-head self-attention formula: $Q = X W_Q; K = X W_K; V = X W_V$

$\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left( \frac{QK^{\top}}{\sqrt{d_{\text{head}}}} \right)V \tag{3}$

At high resolution, windowed or shifted self-attention mitigates quadratic cost.

4. Training Methodology and Optimization Schedules

MambaVision employs hyper-scaled training:

ImageNet-1K: 300 epochs, cosine decay LR (warmup 20, cooldown 20), LAMB optimizer (batch 4096, LR 0.005, weight decay 0.05), widespread augmentations, hardware: 32×A100.
COCO: Mask R-CNN/Cascade Mask R-CNN, 3× LR schedule, LR 1e-4, batch 16, weight decay 0.05 (8×A100).
ADE20K: UPerNet head, AdamW, LR 6e-5, batch 16 (8×A100).

This recipe is crucial for high-throughput and state-of-the-art accuracies.

5. Empirical Performance on Benchmarks

Across classification and dense vision tasks, MambaVision surpasses comparably-sized ViT, Swin, ConvNeXt, and VMamba backbones.

ImageNet-1K Classification (224×224 crops)

Model	Params (M)	FLOPs (G)	Throughput	Top-1 (%)
ConvNeXt-B	88.6	15.4	1485	83.8
Swin-B	88.0	15.4	1245	83.5
VMamba-B	89.0	15.4	645	83.9
MambaVision-B	97.7	15.0	3670	84.2

MambaVision-S (50.1M, 7.5G): 83.3% @ 4700 img/s
MambaVision-T (31.8M, 4.4G): 82.3% @ 6298 img/s

COCO Detection / Instance Segmentation (Mask R-CNN 3×)

Backbone	Box AP	Mask AP
Swin-T	50.4	43.7
ConvNeXt-T	50.4	43.7
MambaVision-T/S	51.0–52.8	44.3–45.7

ADE20K Semantic Segmentation (UPerNet)

Backbone	Params	FLOPs	mIoU (%)
Swin-T	60M	945G	44.5
MambaVision-T	55M	945G	46.6
Swin-S	81M	1038G	47.6
MambaVision-S	84M	1135G	48.2
Swin-B	121M	1188G	48.1
MambaVision-B	126M	1342G	49.1

Notably, MambaVision achieves a new Pareto frontier in accuracy vs. throughput across tasks and scales.

6. Ablation Studies: Token Mixer and Hybrid Block Placement

Key ablations on token mixer design [Eq. (1) vs. causal-only]:

Adding symmetric conv2 and concatenation boosts Top-1, AP, mIoU by 1.8–2.4 points.
Gating (instead of concat) is inferior.
Replacing causal with regular conv alone is modestly helpful.

Hybrid pattern (placement of self-attention in block sequence; Table abl2):

Random mixer/attention: 81.3%
Last $N/2$ attention: 82.3%
Best strategy is to place self-attention blocks in the latter half of the block sequence; not the front or alternate.

7. Architectural and Scaling Implications

MambaVision illustrates the efficacy of:

Vision-tailored selective SSM mixers for high throughput,
Hierarchical stacking (CNN→SSM+Transformer stages) for architectural depth,
Hybrid blocks (Mamba mixer + multi-head attention) preserving both local and global spatial modeling,
Hardware-aware, scalable recipes for practical deployment across classification, detection, and segmentation.

This hybridization sets a high-water mark for linear-complexity visual modeling, offering tunable trade-offs between throughput and accuracy across application domains. Direct integration of selective scan Mamba operators with classic Transformer blocks leverages both the efficient global mixing of SSMs and the fine-grained context sensitivity of attention.

A plausible implication is that future state-space vision architectures will pursue increasingly fine-grained mixing between hardware-optimized SSM blocks and windowed or localized attention, adapting block types and placements dynamically to task, resolution, and computational budget.

PDF Markdown Chat (Pro)

References (1)

MambaVision: A Hybrid Mamba-Transformer Vision Backbone (2024)

Follow Topic

Get notified by email when new papers are published related to Vision Mamba Model.