Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 109 tok/s
Gemini 3.0 Pro 52 tok/s Pro
Gemini 2.5 Flash 159 tok/s Pro
Kimi K2 203 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Vision Mamba: Efficient Visual Backbone

Updated 17 November 2025
  • Vision Mamba Model is a visual backbone architecture based on structured state space models that leverages input-dependent selective scanning for efficient long-range dependency capture.
  • It integrates vision-friendly mixer blocks and hybrid token mixing with self-attention to effectively handle spatial and multimodal data in tasks like classification, detection, and segmentation.
  • Empirical results indicate that Vision Mamba achieves a new Pareto frontier in accuracy versus throughput, outperforming conventional transformers and convolutional networks.

The Vision Mamba family encompasses a class of visual backbone architectures rooted in Structured State Space Models (SSMs), specifically optimized for image, video, and multidimensional data by leveraging linear-time “selective scanning” to capture long-range dependencies. Unlike Transformers with quadratic-complexity attention or convolutional networks with limited receptive field, Vision Mamba architectures use token-dependent SSMs and hardware-aware scan algorithms to deliver a new efficiency frontier in computer vision tasks including classification, detection, segmentation, and multimodal fusion.

1. Mathematical Foundations: Structured State Space and Selective Scan

At the core of Vision Mamba is the continuous-time linear SSM: h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t) where ARM×MA \in \mathbb{R}^{M \times M}, BR1×MB \in \mathbb{R}^{1 \times M}, and CR1×MC \in \mathbb{R}^{1 \times M} are learnable parameters, h(t)RMh(t) \in \mathbb{R}^M is the hidden state, and x(t)x(t) the input.

Discretization by zero-order hold yields: Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB,Cˉ=C\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1} (\exp(\Delta A)-I)\Delta B, \quad \bar{C} = C

h[t]=Aˉh[t1]+Bˉx[t],y[t]=Cˉh[t]h[t] = \bar{A} h[t-1] + \bar{B} x[t], \quad y[t] = \bar{C} h[t]

This discrete recurrence is equivalent to a 1D convolution: K=[CBˉ,  CAˉBˉ,  ,  CAˉT1Bˉ],y=xK\overline{K} = [C\bar{B},\;C\bar{A}\bar{B},\;\dots,\;C \bar{A}^{T-1} \bar{B}],\qquad y = x * \overline{K}

Mamba’s innovation is the “selective scan”, replacing static B,C,ΔB, C, \Delta with input-dependent selectors (small neural networks), yielding dynamic filtering and strictly linear runtime in sequence length TT.

2. Vision-Friendly Redesigns: Mixer Blocks and Hybrid Token Mixing

Direct application of Mamba to vision reveals inadequacies in 1D causal convolution for 2D spatial structure. The MambaVision architecture (Hatamizadeh et al., 10 Jul 2024) introduces the “Vision-friendly Mixer”:

  1. Replaces causal conv with bidirectional depth-wise conv for spatial symmetry.
  2. Adds a parallel non-SSM symmetric conv branch to recover global spatial content.
  3. Splits outputs to C/2C/2 channels per branch, concatenates and projects to CC.

The mixer formalism: X1=Scan(SiLU(Conv1d(LinCC/2(Xin)))) X2=SiLU(Conv1d(LinCC/2(Xin))) Xout=LinC/2C(Concat[X1,X2])(1)\begin{aligned} X_1 &= \text{Scan}\left( \mathrm{SiLU}\left( \mathrm{Conv}_{1d}( \mathrm{Lin}_{C \to C/2}(X_{\mathrm{in}}) ) \right) \right) \ X_2 &= \mathrm{SiLU}\left( \mathrm{Conv}_{1d}( \mathrm{Lin}_{C \to C/2}(X_{\mathrm{in}}) ) \right) \ X_{\mathrm{out}} &= \mathrm{Lin}_{C/2 \to C} \left( \mathrm{Concat}[X_1, X_2] \right) \end{aligned} \tag{1}

Standard vision ablations empirically confirm that regular conv, addition of symmetric branch, and concatenation/project yield substantial accuracy and mask/detection AP improvements over the causal-only baseline.

3. Architectural Design: Hierarchical, Hybrid, and Efficient

MambaVision presents a canonical 4-stage hierarchy:

  • Stem: Two 3×33\times3 convs, stride 2, BN+GELU; output: (H/4×W/4×Cstem)(H/4\times W/4\times C_{\rm stem})
  • Stages 1–2: Pure residual CNN blocks
  • Stages 3–4: Each has NN layers; first N/2N/2 use MambaVision Mixer + MLP, last N/2N/2 use Transformer-style self-attention + MLP

Layer update: X^n=Mixer(Norm(Xn1))+Xn1,Xn=MLP(Norm(X^n))+X^n(2)\hat X^n = \mathrm{Mixer}(\mathrm{Norm}(X^{n-1})) + X^{n-1}, \qquad X^n = \mathrm{MLP}(\mathrm{Norm}(\hat X^n)) + \hat X^n \tag{2}

Channel / resolution schedule for “B” variant: [C=64,128,320,512][C=64,\,128,\,320,\,512]; [56×56,28×28,14×14,7×7][56\times56,\,28\times28,\,14\times14,\,7\times7] spatial.

Self-attention is introduced only in the final N/2N/2 layers per stage, shown in ablation (“Best to put self-attention blocks in the final half”). Multi-head self-attention formula: Q=XWQ;K=XWK;V=XWVQ = X W_Q; K = X W_K; V = X W_V

Attention(Q,K,V)=Softmax(QKdhead)V(3)\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left( \frac{QK^{\top}}{\sqrt{d_{\text{head}}}} \right)V \tag{3}

At high resolution, windowed or shifted self-attention mitigates quadratic cost.

4. Training Methodology and Optimization Schedules

MambaVision employs hyper-scaled training:

  • ImageNet-1K: 300 epochs, cosine decay LR (warmup 20, cooldown 20), LAMB optimizer (batch 4096, LR 0.005, weight decay 0.05), widespread augmentations, hardware: 32×A100.
  • COCO: Mask R-CNN/Cascade Mask R-CNN, 3× LR schedule, LR 1e-4, batch 16, weight decay 0.05 (8×A100).
  • ADE20K: UPerNet head, AdamW, LR 6e-5, batch 16 (8×A100).

This recipe is crucial for high-throughput and state-of-the-art accuracies.

5. Empirical Performance on Benchmarks

Across classification and dense vision tasks, MambaVision surpasses comparably-sized ViT, Swin, ConvNeXt, and VMamba backbones.

ImageNet-1K Classification (224×224 crops)

Model Params (M) FLOPs (G) Throughput Top-1 (%)
ConvNeXt-B 88.6 15.4 1485 83.8
Swin-B 88.0 15.4 1245 83.5
VMamba-B 89.0 15.4 645 83.9
MambaVision-B 97.7 15.0 3670 84.2
  • MambaVision-S (50.1M, 7.5G): 83.3% @ 4700 img/s
  • MambaVision-T (31.8M, 4.4G): 82.3% @ 6298 img/s

COCO Detection / Instance Segmentation (Mask R-CNN 3×)

Backbone Box AP Mask AP
Swin-T 50.4 43.7
ConvNeXt-T 50.4 43.7
MambaVision-T/S 51.0–52.8 44.3–45.7

ADE20K Semantic Segmentation (UPerNet)

Backbone Params FLOPs mIoU (%)
Swin-T 60M 945G 44.5
MambaVision-T 55M 945G 46.6
Swin-S 81M 1038G 47.6
MambaVision-S 84M 1135G 48.2
Swin-B 121M 1188G 48.1
MambaVision-B 126M 1342G 49.1

Notably, MambaVision achieves a new Pareto frontier in accuracy vs. throughput across tasks and scales.

6. Ablation Studies: Token Mixer and Hybrid Block Placement

Key ablations on token mixer design [Eq. (1) vs. causal-only]:

  • Adding symmetric conv2 and concatenation boosts Top-1, AP, mIoU by 1.8–2.4 points.
  • Gating (instead of concat) is inferior.
  • Replacing causal with regular conv alone is modestly helpful.

Hybrid pattern (placement of self-attention in block sequence; Table abl2):

  • Random mixer/attention: 81.3%
  • Last N/2N/2 attention: 82.3%
  • Best strategy is to place self-attention blocks in the latter half of the block sequence; not the front or alternate.

7. Architectural and Scaling Implications

MambaVision illustrates the efficacy of:

  • Vision-tailored selective SSM mixers for high throughput,
  • Hierarchical stacking (CNN→SSM+Transformer stages) for architectural depth,
  • Hybrid blocks (Mamba mixer + multi-head attention) preserving both local and global spatial modeling,
  • Hardware-aware, scalable recipes for practical deployment across classification, detection, and segmentation.

This hybridization sets a high-water mark for linear-complexity visual modeling, offering tunable trade-offs between throughput and accuracy across application domains. Direct integration of selective scan Mamba operators with classic Transformer blocks leverages both the efficient global mixing of SSMs and the fine-grained context sensitivity of attention.

A plausible implication is that future state-space vision architectures will pursue increasingly fine-grained mixing between hardware-optimized SSM blocks and windowed or localized attention, adapting block types and placements dynamically to task, resolution, and computational budget.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vision Mamba Model.