Mamba Blocks: Scalable SSM Modules

Updated 11 November 2025

Mamba blocks are computational modules built on state-space models that offer scalable, efficient long-range dependency modeling with linear time and memory complexity.
They replace quadratic self-attention with dynamic, input-dependent parameterization, enabling robust global context capture and efficient parallelism.
Architectural variants and hybrid compositions across tasks like image, video, and sequence modeling lead to significant speed, memory, and accuracy improvements.

Mamba blocks are computational modules engineered around state-space models (SSMs) with selective, input-dependent parameters. They constitute the principal building blocks in the Mamba architecture, offering a scalable mechanism for long-range dependency modeling with linear complexity in both time and memory. The design of Mamba blocks allows for direct replacement of quadratic-cost self-attention operations in deep networks while guaranteeing global receptive fields and efficient parallelism. Their mathematical foundations, architectural variants, and hybridization strategies position Mamba blocks as foundational components across a wide spectrum of image, video, multimodal, and sequence modeling tasks.

1. Mathematical Formulation of the Mamba Block

The canonical Mamba block is built upon a linear, time-invariant SSM, typically expressed in continuous form as: $h'(t) = A h(t) + B x(t), \quad y(t) = C h(t) + D x(t)$ where $A \in \mathbb{C}^{N \times N}$ is the transition matrix, $B,C \in \mathbb{C}^{N}$ are input/output mappings, and $D \in \mathbb{C}$ is a direct term.

Discretization with step $\Delta$ (zero-order hold) gives: $h_k = \bar{A} h_{k-1} + \bar{B} x_k, \quad y_k = C h_k + D x_k, \quad \bar{A} = e^{\Delta A}, \quad \bar{B} \approx \Delta B$ This structure enables linear-time computation in sequence length $L$ , requiring only rank- $N$ matrix-vector multiplies per step, resulting in total $O(LNd)$ computational cost. Modern Mamba blocks introduce "selective" mechanisms: $\bar{A}_k$ , $\bar{B}_k$ , and $C_k$ are token-wise functions of $x_k$ , implemented via small learned projections or convolutions, ensuring input-adaptive dynamics.

2. Architectural Variants and Nesting Strategies

Vision: MiM-ISTD and Hierarchical Models

MiM-ISTD (Chen et al., 2024) exemplifies a nested block architecture. The Outer Mamba block operates at the macro-level ("visual sentences" via image patches), providing global modeling, while the Inner Mamba block acts on "visual words" (sub-patches within a patch) to recover local detail. Pseudocode illustrates the layered update:

for i in 1…n_s:
    W^i_ℓ = W^i_{ℓ–1} + Mamba( LayerNorm(W^i_{ℓ–1}) )  # Inner Mamba
for i in 1…n_s:
    S^i_{ℓ–½} = S^i_{ℓ–1} + FC( Vec(W^i_{ℓ}) )         # Word to sentence
S_{ℓ} = S_{ℓ–½} + Mamba( LayerNorm(S_{ℓ–½}) )          # Outer Mamba

This hierarchy achieves

O(n)

time per block, outperforming self-attention-based SOTA by up to

10\times

in speed and

>70\%

in peak memory.

Multimodal, Point Clouds, and 3D

Pamba (Li et al., 2024) employs ConvMamba blocks, which serialize unordered point clouds via multiple space-filling curves (Hilbert, Morton) and apply bidirectional SSM scans. Local geometric context is aggregated by sparse convolution, while global context arises from the Mamba scan. MambaFusion (Wang et al., 6 Jul 2025) extends this model, adding height-fidelity encoding and Hybrid Mamba Blocks that alternate local and global mixing in both raw and bird’s-eye view spaces.

Super-Resolution and Hierarchical Blocks

Hi-Mamba (Qiao et al., 2024) introduces Hierarchical Mamba Blocks (HMBs) combining Local SSMs and Region SSMs with single-direction scanning. Direction alternation (DA-HMG) cascades blocks with cycling scan directions, efficiently covering full 2D spatial context without the $4\times$ overhead of traditional 2D SSM. Experimental results demonstrate up to $0.29$ dB PSNR gains and $61\%$ runtime reduction against multi-directional approaches.

3. Selectivity, Attention Analogy, and Input-Dependent Dynamics

The defining feature of Mamba blocks is the selection of state-space parameters as input-dependent functions, drawing a direct analogy to the "attention maps" in transformers. In CrackMamba (He et al., 2024), selection mechanisms generate per-position matrices $A_k$ , $B_k$ , $C_k$ conditioned on $x_k$ , akin to how self-attention computes adaptive weights: $B_k = W_B x_k + b_B,\quad C_k = W_C x_k + b_C,\quad A_k = \exp(A_0 + W_A x_k + b_A)$ In the attention perspective, these token-wise kernels replace the explicit pairwise attention computations with a scan that computes dynamic context weighting. This design, when fused with convolutional attention maps, yields robust global receptive fields and parameter/memory savings.

4. Computational Complexity: Linear-Time Scaling and Pragmatic Implementation

Across domains, Mamba block complexity is $O(LNd)$ (with $L$ token length, $N$ hidden size, $d$ feature dim), compared to $O(L^2 d)$ for full self-attention. For models with nested blocks or multi-scale flows (e.g., MiM-ISTD, ms-Mamba (Karadag et al., 10 Apr 2025)), the additional cost from inner blocks or multiple scales is negligible, with $m \ll n$ , $c \ll D$ for inner word modeling and $S \ll L$ for multi-scale fusion. Parallel scan implementations, hardware-aware dataflow tricks, and bidirectional sweeps further enable linear-time inference even for very long sequences; fitting whole-SOTA 3D fusion pipelines into single-GPU memory [MambaFusion, (Wang et al., 6 Jul 2025)].

5. Hybrid Compositions and Integration with Other Paradigms

Hybrid architectures interleave Mamba blocks with attention, CNNs, or Mixture-of-Experts (MoE) for architectural flexibility.

MambaVision (Hatamizadeh et al., 2024) employs Mamba blocks for early multi-scale mixing, substituting MHSA in deeper layers for global relational modeling, with ablations favoring late-stage attention for best overall accuracy.
Jamba (Lieber et al., 2024) uses heavy interleaving (ratio 1:7 of attention:Mamba) with MoE on selected Mamba layers. This approach reduces KV cache memory from $32$ GB to $4$ GB for $256$K contexts, achieves $3\times$ inference throughput gains, and matches or surpasses Llama-2-70B and Mixtral across benchmarks.
PKD (Medina et al., 3 Mar 2025): Mamba blocks in student models enable aggressive model compression (down to $1\%$ teacher FLOPs for weak students), yielding ensembles that closely track teacher accuracy with major resource savings.

6. Empirical Impact and Performance Benchmarks

Empirical evaluations demonstrate consistently competitive or SOTA performance in diverse contexts:

MiM-ISTD (Chen et al., 2024): $10\times$ speed, $73\%$ lower GPU memory, matched/better IoU and nIoU on NUAA-SIRST/IRSTD-1k.
PKD (Medina et al., 3 Mar 2025): MNIST ensemble ( $98\%$ acc @ $63\%$ teacher FLOPs), CIFAR-10 ensemble ( $86\%$ @ $20\%$ FLOPs). Weak learners at $1–5\%$ compute maintain $50–72\%$ accuracy.
Hi-Mamba (Qiao et al., 2024): $+0.29$ dB PSNR gain for SR, $>50\%$ FLop savings, multi-directional coverage with single scans.
MambaFusion (Wang et al., 6 Jul 2025): $75.0$ NDS/$72.7$ mAP at 4.7 FPS—exceeds UniTR by $+1.7$ NDS/ $+2.2$ mAP, $1.5\times$ faster than alternative global fusion.
CrackMamba (He et al., 2024): $+1.41$ miIoU/ $+1.10$ miDice (Deepcrack), $5.84/4.38$ gain on Steelcrack, all at $>40\%$ parameter/MAC reduction.

7. Design Choices, Hyperparameters, and Limitations

Key block parameters include SSM hidden size $(N)$ , directionality (single vs multi), scan pattern, convolution/gating factors, block depth, bidirectionality (enabling both forward/reverse passes), and fusion strategies (e.g., attention, channel blending).

Hardware-aware hyperparameter selection (e.g., block count, kernel size, expansion factor) allows efficient deployment within FLOPs and latency constraints. Limitations include increased hyperparameter and parameter count for multi-scale or nested blocks, and potential modeling deficits if single-scan directionality fails to capture spatial relationships (addressed by DA-HMG and hybrid compositions).

A plausible implication is that future extensions could leverage token pruning within SSM scans, adapt block types to multimodal fusion with conditional SSM kernels, or blend SSMs with more powerful spatial convolution for low-level vision tasks, and integrate block selection/dynamic routing for further acceleration.

Summary Table: Computational Complexity in Representative Mamba Block Variants

Model/Task	Mamba Block Complexity	Transformer Complexity	Empirical Speed / Memory Gains
MiM-ISTD (Chen et al., 2024)	$O(n D^2)$ (linear $n$ )	$O(n^2 D)$	$10\times$ faster, $73\%$ less memory
Hi-Mamba (Qiao et al., 2024)	$O(HW C)$ per block	$O((HW)^2 C)$	$+0.29$ dB PSNR, $61\%$ less time
MambaFusion (Wang et al., 6 Jul 2025)	$O(N C^2)$ (tokens $N$ )	$O(N^2)$	$1.5\times$ faster, $+2.2$ mAP
Pamba (Li et al., 2024)	$O(N d)$	$O(N^2 d)$	$2\times$ faster, $1/3$ GPU memory
PKD (Medina et al., 3 Mar 2025)	$O(L d^2 + L SSM^2)$	–	Up to $99\%$ teacher accuracy, $<63\%$ FLOPs

Mamba blocks thus serve as modular, domain-adaptive, and highly efficient alternatives to attention for global modeling, enabling a new generation of scalable deep architectures across modalities, sequence lengths, and application domains.

Markdown Upgrade to Chat

References (9)

MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection (2024)

Pamba: Enhancing Global Interaction in Point Clouds via State Space Model (2024)

MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection (2025)

Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution (2024)

Mamba meets crack segmentation (2024)

ms-Mamba: Multi-scale Mamba for Time-Series Forecasting (2025)

MambaVision: A Hybrid Mamba-Transformer Vision Backbone (2024)

Jamba: A Hybrid Transformer-Mamba Language Model (2024)

Mamba base PKD for efficient knowledge compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba Blocks.

Mamba Blocks: Scalable SSM Modules

1. Mathematical Formulation of the Mamba Block

2. Architectural Variants and Nesting Strategies

Vision: MiM-ISTD and Hierarchical Models

Multimodal, Point Clouds, and 3D

Super-Resolution and Hierarchical Blocks

3. Selectivity, Attention Analogy, and Input-Dependent Dynamics

4. Computational Complexity: Linear-Time Scaling and Pragmatic Implementation

5. Hybrid Compositions and Integration with Other Paradigms

6. Empirical Impact and Performance Benchmarks

7. Design Choices, Hyperparameters, and Limitations

Summary Table: Computational Complexity in Representative Mamba Block Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Mamba Blocks: Scalable SSM Modules

1. Mathematical Formulation of the Mamba Block

2. Architectural Variants and Nesting Strategies

Vision: MiM-ISTD and Hierarchical Models

Multimodal, Point Clouds, and 3D

Super-Resolution and Hierarchical Blocks

3. Selectivity, Attention Analogy, and Input-Dependent Dynamics

4. Computational Complexity: Linear-Time Scaling and Pragmatic Implementation

5. Hybrid Compositions and Integration with Other Paradigms

6. Empirical Impact and Performance Benchmarks

7. Design Choices, Hyperparameters, and Limitations

Summary Table: Computational Complexity in Representative Mamba Block Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research