Mamba Block for Efficient Vision Modeling

Updated 1 January 2026

Mamba Block is a hardware-oriented state-space module that employs input-dependent selection and fused parallel scanning for efficient long-sequence modeling.
It integrates discretized state-space recurrence, per-step gating, and GPU optimizations to mitigate the quadratic complexity of traditional self-attention.
In vision tasks, Mamba Block achieves up to a 2× speedup over Transformer baselines, enabling real-time deployment on high-resolution data.

The Mamba Block is a hardware-oriented state space model (SSM) module featuring input-dependent selection mechanisms, fused parallel scanning, and linear-time complexity. As a foundational building block introduced by Gu & Dao (2023), and adapted in visual architectures by Zhu et al. (2024) and Liu et al. (2024), the Mamba Block enables efficient long-sequence modeling in computer vision, substantially reducing the quadratic compute bottleneck associated with self-attention. Its design integrates discretized state-space recurrence, per-step gating, and hardware-aware optimizations to allow real-time deployment at large sequence lengths, including flattened image patch streams and video frames. The following sections delineate its architecture, mathematical formulations, hardware optimizations, vision backbone variants, complexity benchmarks, and current research trajectories (Zhang et al., 2024).

1. Block Architecture and Data Flow

The Mamba Block consists of sequential modules arranged as follows:

Input feature $x_t \in \mathbb{R}^d$ is transformed by a gating linear projection:

$u_t = \mathrm{SiLU}(W_u x_t + b_u)$

The gated output $u_t$ feeds into a selective, discretized SSM layer that computes the hidden state:

$h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t$

Output projection through a second gate:

$y_t = \mathrm{SiLU}(W_v h_t + b_v)$

The block adds the output back to the original input (residual connection):

$x_{t+1} = x_t + y_t$

Typically, the block is wrapped by layer normalization and residual "pull".

The architectural schematic supports a pure 1D flow for sequence modeling, and can be extended to multi-dimensional variants by flattening images and applying the scan in different orders (Zhang et al., 2024). All interactions are in the projected dimension $\mathbb{R}^d$ .

2. Selective State Space Model Formalism

The core recurrence is derived from continuous-time SSM:

$h'(t) = A h(t) + B x(t),\quad y(t) = C h(t)$

Discretization using zero-order hold per step $\Delta$ yields:

$\bar{A} = \exp(\Delta A),\quad \bar{B} = (A\Delta)^{-1}(\exp(\Delta A)-I)B$

$h_t = \bar{A} h_{t-1} + \bar{B} x_t,\quad y_t = C h_t$

Input-dependent selection enriches the parameterization by allowing $B_t = S_B(x_{1:t})$ , $C_t = S_C(x_{1:t})$ , and $\Delta_t = \operatorname{softplus}(\Delta_0 + S_\Delta(x_{1:t}))$ . These selection projections are typically lightweight, and adapt SSM dynamics as a function of the input history. The functional update within the block becomes:

$\begin{aligned} \bar{A}_t &= \exp(\Delta_t A) \ \bar{B}_t &= (\Delta_t A)^{-1}(\exp(\Delta_t A) - I)\Delta_t B \ h_t &= \bar{A}_t h_{t-1} + \bar{B}_t x_t \ y_t &= C_t h_t \end{aligned}$

Surrounding the SSM are two SiLU-gated linear projections—first on input, second on output—to confer nonlinearity and further modulate memory flow (Zhang et al., 2024).

3. Hardware-Aware Optimizations and GPU Parallelism

Mamba Block achieves competitive wall-clock performance through:

Kernel fusion: All gating, recurrence, and output projection steps are merged in a single GPU kernel, minimizing memory I/O and accelerating data movement.
Parallel prefix scan: The Mamba scan leverages a parallelized variant of the Blelloch scan to compute state updates in $O(L)$ time per block (where $L$ is sequence length), with $O(\log L)$ parallel depth. Intermediate products are re-computed in the fused kernel, and states $h_t$ are laid out in coalesced memory.
Quantization and pruning: INT8 quantization of model parameters reduces inference time with negligible accuracy loss (<1%). Sparsity can be imposed on $A$ and $B$ for further footprint reduction.
Batch layout: All forward and backward passes are computed in a single tensor with dimensions $[batch, L, N]$ , suitable for large-scale vision tasks.

These innovations yield up to a $2\times$ speedup over Transformer baselines in wall-clock GPU runtime (A100, $L=2048$ , $batch=32$ ) (Zhang et al., 2024).

4. Vision-Specific Extensions: Convolution, Recurrence, Attention

Modern vision backbones rarely use stand-alone Mamba Blocks; instead, they incorporate additional modules to model local and multi-dimensional structure:

Local convolution: 1D/2D depth-wise convolutions may precede the gating layer to encode local spatial features.
Bi-directional scanning (e.g. ViM block): Pairs of scans in opposite directions across the feature sequence are computed and fused via additional gating layers.
Cross-scan (e.g. VSS block): Scans are run over multiple traversal orders (row, column, etc.), and merged for enhanced spatial coverage.
Hybrid blocks (e.g. MMA block): Mamba streams are run in parallel with attention-based modules (e.g. channel-attention), with outputs merged prior to nonlinearity and output projection.

All variants maintain the SSM core but modify input projection, scan directionality and stream merger to improve spatial, channel, and recurrence modeling depth in vision tasks (Zhang et al., 2024).

5. Computational Complexity, Benchmarks, and Efficiency

Mamba Block exhibits the following performance characteristics:

Time complexity: $O(NL)$ per block for state dimension $N$ and sequence length $L$ . Self-attention is $O(L^2 d)$ .
Memory: Stores $O(NL)$ state, no attention matrix.
Wall-clock: On an A100 GPU, fused Mamba runs $1.5$– $2\times$ faster than corresponding Transformer attention blocks at mainstream $L$ and batch sizes.
Vision benchmarks:
- ViM-Tiny (~7M params, $4.5$ GFLOPs) matches or exceeds lightweight DeiT-Tiny, at lower peak memory.
- VMamba-Small ($9.1$ GFLOPs) is within $1$– $2\%$ of Swin-Tiny accuracy on ImageNet-1K, with $20$– $30\%$ faster inference.
Scalability: Mamba Blocks are “drop-in” linear-time alternatives to attention layers for sequence and image patch streams of arbitrary length.

6. Significance, Limitations, and Current Directions

The Mamba Block addresses the prohibitive quadratic complexity of vision Transformers, enabling real-time global modeling on high-resolution data. Its input-dependent state-space design and hardware-oriented scan unlock linear scaling and broad extensibility (2D, 3D, multi-modal, temporal, etc.). Significant applications include object detection, segmentation, medical imaging, remote sensing, and video analysis, with numerous benchmarks demonstrating superior throughput and competitive accuracy.

Current research explores integrating convolutional, attention, and multi-directional recurrence variants; refining quantization; and augmenting block-level selection and gating for richer expressivity. Limitations include the inability to inherently model local structure (prompting conv/recurrence grafts), and reliance on proper scan order and selection for maximal effectiveness.

Future work is focused on further hardware specialization, dynamic state-space structure, and theoretical characterizations of input-dependent discretization (Zhang et al., 2024).

Markdown Upgrade to Chat

References (1)

A Survey on Visual Mamba (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba Block.