Vision Mamba Architectures

Updated 16 December 2025

Vision Mamba architectures are an emerging class of deep neural networks that replace quadratic self-attention with structured state-space models, enabling linear scalability and hardware-aware efficiency.
They integrate bidirectional and multi-scale scanning with hybrid modules to achieve robust global context modeling, efficient spatial resolution handling, and competitive performance across vision tasks.
Applications span remote sensing, 3D medical imaging, video perception, and multimodal fusion, demonstrating potential for state-of-the-art outcomes in high-resolution vision domains.

Vision Mamba architectures are a class of deep neural network models that replace the quadratic-complexity self-attention of transformers with linearly scalable, structured state-space models (SSMs) augmented by selective parameterization and hardware-aware design. These frameworks generalize the successful language modeling capabilities of SSM-based models (notably Mamba/S4) to vision, offering global context modeling, efficient scaling to high spatial or spatiotemporal resolution, and high throughput across diverse computer vision domains (Zhang et al., 24 Apr 2024, Liu et al., 7 May 2024, Li et al., 20 May 2025). Recent advancements include bidirectional and multi-scale scanning, hybrid SSM-convolutional/token-mixing modules, hierarchical and multi-stage backbones, and fusion with transformer blocks. The Vision Mamba family also encompasses configurational innovations for remote sensing, 3D medical imaging, video perception, and multimodal fusion.

1. Mathematical Foundations and Model Formulation

The mathematical core of Vision Mamba architectures is the structured state-space model (SSM), originally established in control theory and signal processing, adapted to the deep learning paradigm. The continuous-time SSM is formalized as: $\frac{d}{dt}h(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)$ where $h(t)$ is the hidden state, $x(t)$ the input, and $y(t)$ the output. For sequence modeling, discretization via zero-order hold yields: $\begin{aligned} \overline{A} &= \exp(\Delta A),\quad \overline{B} = (\Delta A)^{-1}(\exp(\Delta A) - I)\Delta B \ h_t &= \overline{A} h_{t-1} + \overline{B} x_t, \qquad y_t = C h_t \end{aligned}$ Mamba generalizes this recurrence by making $B, C, \Delta$ input-dependent via selection networks: $B_t = S_B(x_t), \quad C_t = S_C(x_t), \quad \Delta_t = \tau_\Delta(\Delta_0 + S_\Delta(x_t))$ with $\tau_\Delta$ typically a softplus. This permits per-token, per-batch, and per-channel dynamic adaptation, overcoming the rigidity of time-invariant SSMs (Liu et al., 7 May 2024, Zhang et al., 24 Apr 2024).

The discrete SSM yields a 1D convolutional kernel $K$ : $K = [C\overline{B},~ C\overline{A}\overline{B},~\ldots,~C\overline{A}^{L-1}\overline{B}]$ The output sequence is then $y = x * K$ , ensuring parallelizable O(L) complexity per layer for sequence length L.

2. Scanning Strategies and Spatialization

Transitioning SSMs from 1D sequence modeling to vision necessitates specific scan strategies to accommodate multidimensional locality and global feature propagation.

Bi-directional and Multi-directional Scans: Bidirectional (forward and backward) selective scans mitigate the causality mismatch between SSM and non-sequential 2D input, improving information integration (Zhang et al., 24 Apr 2024, Liu et al., 7 May 2024). VMamba and related architectures implement four-directional 2D Selective Scan (SS2D), traversing feature grids in up/down/left/right or diagonal patterns (Liu et al., 18 Jan 2024).
Fractal/Space-Filling Scans: FractalMamba++ leverages Hilbert curve traversal to preserve spatial locality under 2D→1D flattening, combating long-range dependency fading and enabling robust scaling to very high resolutions. Cross-State Routing (CSR) supplements the scan order with nonlocal skip connections for enhanced global context (Li et al., 20 May 2025).
Patchification and Hierarchies: Vision Mamba encoders often use convolutional stems for local inductive bias, followed by hierarchical downsampling (patch merging) and multi-stage SSM blocks for efficient scaling and token mixing (Liu et al., 18 Jan 2024, Zhang et al., 24 Apr 2024). Multi-scale scanning is further refined in MSVMamba by combining full- and downsampled-resolution feature propagation to maximize receptive field coverage under limited parameters (Shi et al., 23 May 2024).
Scan Taxonomy: Scans are categorized by direction (uni- or bidirectional), axis (rows, columns, diagonals), continuity (raster, zigzag, fractal), and sampling (global, local window, atrous/dilated) (Xu et al., 29 Apr 2024, Liu et al., 7 May 2024). Across vision tasks, the optimal choice remains task- and domain-dependent.

3. Canonical Vision Mamba Modules

A standard Vision Mamba backbone consists of:

State-Space Core Block: The main token-mixer is a block that fuses an SSM core (for global mixing) with local convolutions or MLPs for spatial/channelwise mixing. The SSM core implements one or more parallel scan directions over the token sequence, each realized as an input-adaptive discrete SSM (Liu et al., 18 Jan 2024, Liu et al., 7 May 2024).
Feed-Forward and Gating Paths: Many architectures employ dual branches, e.g., (i) a depthwise convolution/MLP with activation, and (ii) a visual SSM (Selective Scan), whose outputs are multiplied and merged through linear projection and residual addition (Liu et al., 18 Jan 2024, Chen et al., 24 Jun 2024). EfficientViM reduces computational cost by bottlenecking the expensive channel-mixing to a small compressed hidden state before reconstruction (Lee et al., 22 Nov 2024).
Normalization and Staging: LayerNorm/batch normalization ensure stability. Hierarchical design, with channel and spatial resolution doubling/halving at each stage, aligns with practices in hierarchical transformers and CNNs (Liu et al., 18 Jan 2024, Shi et al., 23 May 2024).
Selective Gating and Register Augmentation: Mamba-R introduces register tokens, interleaved and recycled for output aggregation, countering high-norm background artifacts and enhancing discriminative capacity (Wang et al., 23 May 2024).

4. Computational Complexity and Hardware Adaptation

Vision Mamba architectures achieve linear time and space complexity in token count, in contrast to the $O(L^2)$ cost of transformers. For a vision sequence of N tokens and channel dimension D:

SSM/Mamba Block: O(ND) (token mixing) + O(ND²) (MLP).
Self-Attention Block: O(N²D) for token mixing + O(ND²) for MLP.

Linear scaling enables practical training/inference on high-resolution and volumetric (3D) data, sustaining high throughput versus attention-based models (Liu et al., 18 Jan 2024, A et al., 9 Jun 2024, Dai et al., 19 Feb 2025, Lee et al., 22 Nov 2024).

Hardware-aware implementations exploit fused scan algorithms, efficient SRAM usage, and quantization. Mamba-X leverages systolic scan arrays and mixed-precision quantization to accelerate SSM recurrences and minimize memory traffic on edge devices, offering 2×–10× real speedware improvement and dramatic energy savings (Yoon et al., 5 Aug 2025). EfficientViM and MobileViM exploit channel/hidden state compression and axis-wise factorization for further cost reduction (Lee et al., 22 Nov 2024, Dai et al., 19 Feb 2025).

5. Vision Mamba Variants and Hybrid Backbones

Vision Mamba encompasses a spectrum of architectural variants, each tailored for domain-specific requirements.

Generic Backbones: Vim (ViT-style, bidirectional SSM on 1D patch sequence), VMamba (hierarchical, 2D SS2D cross-scan, patch merging), LocalMamba (windowed scan + local attention), EfficientVMamba (dilated/atrous scans plus squeeze-excitation), PlainMamba (snake/zigzag non-hierarchical scan) (Zhang et al., 24 Apr 2024, Xu et al., 29 Apr 2024, Liu et al., 18 Jan 2024).
Multi-Scale and Multi-Stage Models: MSVMamba employs multi-scale scanning within each block (“hierarchy in hierarchy”), trading redundancy for parameter and FLOP efficiency, and introducing ConvFFN modules for enhanced channel mixing (Shi et al., 23 May 2024).
Hybrid Mamba-Transformer and Mamba-Convolution Models: MambaVision and HybridMH interleave or stack SSM token-mixing with windowed/self-attention, leveraging the efficiency of Mamba at high-resolution and the global context of transformers in later stages (Hatamizadeh et al., 10 Jul 2024, Liu et al., 1 Oct 2024). Hybrid pretraining strategies (MAP) optimally exploit both local and sequential cues (Liu et al., 1 Oct 2024).
Cross-Modal and 3D Extensions: UAVD-Mamba fuses SSMs with deformable convolutional tokens for IR/RGB UAV detection (Li et al., 1 Jul 2025). MobileViM and 3D MRI classifiers deploy dimension-independent, axis-parallel SSM mixing for volumetric segmentation and classification (Dai et al., 19 Feb 2025, A et al., 9 Jun 2024).
Register-Augmented and Specialized Variants: Mamba-R inserts register tokens to suppress background artifacts and enhance scaling (Wang et al., 23 May 2024); see also numerous specialized variants in remote sensing and point cloud processing (Bao et al., 1 May 2025, Xu et al., 29 Apr 2024).

6. Empirical Performance and Scaling Analyses

Vision Mamba backbones achieve accuracy and throughput competitive with or superior to leading CNN and transformer models across classification, detection, semantic segmentation, and specialized tasks:

ImageNet-1K: ViM/VMamba and LocalMamba reach 82%–84% top-1 (Tiny/Base), with LocalMamba-S surpassing Swin-T/Swin-S at equivalent FLOPs (Liu et al., 18 Jan 2024, Zhang et al., 24 Apr 2024, Xu et al., 29 Apr 2024).
COCO Detection/Instance Segmentation: VMamba-T/S/B and LocalVMamba-S match or exceed Swin/ConvNeXt counterparts (e.g., 48.5 box AP for VMamba-B at 18G FLOPs) (Rodriguez-Sanchez et al., 5 Apr 2024, Zhang et al., 24 Apr 2024).
High-Resolution Scaling: VMamba degrades accuracy much more gracefully than transformers under extreme (e.g., 1024²) resolutions, with linear scaling in FLOPs (Liu et al., 18 Jan 2024, Li et al., 20 May 2025).
3D Medical/Remote Sensing: MobileViM attains real-time inference rates (>90 FPS) and high Dice for volumetric segmentation (Dai et al., 19 Feb 2025). In remote sensing, Mamba-based backbone models outperform traditional CNNs and ViTs by 2–5% in various accuracy metrics under linear complexity (Bao et al., 1 May 2025).
Hybrid and Pretrained Models: Hybrid architectures equipped with Masked Autoregressive Pretraining (MAP) set state-of-the-art benchmarks, with consistent gains over pure Transformer or pure Mamba equivalents (Liu et al., 1 Oct 2024).

7. Challenges, Comparative Analysis, and Future Directions

While Vision Mamba backbones solve the quadratic bottleneck of attention and enable hardware-efficient linear scaling, critical challenges remain:

Scan Mechanisms and Spatial Inductive Bias: Selecting or learning optimal scan orders for 2D/3D remains context-dependent, and suboptimal spatialization can harm locality or global context (Xu et al., 29 Apr 2024, Li et al., 20 May 2025).
Causality Mismatch and Non-Causal SSM Design: 1D (causal) recurrences do not natively match non-causal 2D/3D vision tasks; bidirectional scan, multi-branch, and skip-routing correct this heuristically, but a well-principled non-causal SSM formulation is open (Xu et al., 29 Apr 2024, Liu et al., 7 May 2024).
Redundancy and Parameter Efficiency: Multi-directional/multi-scale scanning introduces redundancy; modules such as MSVMamba and EfficientViM reduce this via hidden state compression and scan-sharing, but optimal trade-offs are still under investigation (Shi et al., 23 May 2024, Lee et al., 22 Nov 2024).
Interpretability, Stability, and Generalization: The black-box behavior of input-dependent SSMs complicates mechanistic understanding, while very deep SSM stacks can suffer from stability instabilities. Enhanced frequency-domain modeling (e.g. EinFFT, SiMBA) and combined convolutional/attention modules are developing remedies (Liu et al., 7 May 2024, Xu et al., 29 Apr 2024).
Hybridization and Multimodal Fusion: Research is rapidly advancing in hybrid backbones (Mamba-Transformer, Mamba-Convolution, Mamba-Multimodal), foundation pretraining regimes, and unified SSMs for large, multimodal vision-LLMs (Liu et al., 1 Oct 2024, Bao et al., 1 May 2025, Li et al., 1 Jul 2025).

In summary, Vision Mamba architectures establish a hardware-friendly, theoretically grounded, and empirically validated alternative to attention-based models for a broad range of vision tasks, supporting high-resolution and long-sequence domains previously intractable for transformers. Continued progress in scan-order discovery, non-causal SSM foundation, hybrid modeling, and hardware–algorithm co-design is essential for realizing the full potential of the Vision Mamba paradigm (Zhang et al., 24 Apr 2024, Liu et al., 7 May 2024, Xu et al., 29 Apr 2024, Yoon et al., 5 Aug 2025, Li et al., 20 May 2025, Lee et al., 22 Nov 2024).