Papers
Topics
Authors
Recent
2000 character limit reached

Mamba Blocks: Scalable SSM Modules

Updated 11 November 2025
  • Mamba blocks are computational modules built on state-space models that offer scalable, efficient long-range dependency modeling with linear time and memory complexity.
  • They replace quadratic self-attention with dynamic, input-dependent parameterization, enabling robust global context capture and efficient parallelism.
  • Architectural variants and hybrid compositions across tasks like image, video, and sequence modeling lead to significant speed, memory, and accuracy improvements.

Mamba blocks are computational modules engineered around state-space models (SSMs) with selective, input-dependent parameters. They constitute the principal building blocks in the Mamba architecture, offering a scalable mechanism for long-range dependency modeling with linear complexity in both time and memory. The design of Mamba blocks allows for direct replacement of quadratic-cost self-attention operations in deep networks while guaranteeing global receptive fields and efficient parallelism. Their mathematical foundations, architectural variants, and hybridization strategies position Mamba blocks as foundational components across a wide spectrum of image, video, multimodal, and sequence modeling tasks.

1. Mathematical Formulation of the Mamba Block

The canonical Mamba block is built upon a linear, time-invariant SSM, typically expressed in continuous form as: h(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)h'(t) = A h(t) + B x(t), \quad y(t) = C h(t) + D x(t) where ACN×NA \in \mathbb{C}^{N \times N} is the transition matrix, B,CCNB,C \in \mathbb{C}^{N} are input/output mappings, and DCD \in \mathbb{C} is a direct term.

Discretization with step Δ\Delta (zero-order hold) gives: hk=Aˉhk1+Bˉxk,yk=Chk+Dxk,Aˉ=eΔA,BˉΔBh_k = \bar{A} h_{k-1} + \bar{B} x_k, \quad y_k = C h_k + D x_k, \quad \bar{A} = e^{\Delta A}, \quad \bar{B} \approx \Delta B This structure enables linear-time computation in sequence length LL, requiring only rank-NN matrix-vector multiplies per step, resulting in total O(LNd)O(LNd) computational cost. Modern Mamba blocks introduce "selective" mechanisms: Aˉk\bar{A}_k, Bˉk\bar{B}_k, and CkC_k are token-wise functions of xkx_k, implemented via small learned projections or convolutions, ensuring input-adaptive dynamics.

2. Architectural Variants and Nesting Strategies

Vision: MiM-ISTD and Hierarchical Models

MiM-ISTD (Chen et al., 4 Mar 2024) exemplifies a nested block architecture. The Outer Mamba block operates at the macro-level ("visual sentences" via image patches), providing global modeling, while the Inner Mamba block acts on "visual words" (sub-patches within a patch) to recover local detail. Pseudocode illustrates the layered update:

1
2
3
4
5
for i in 1n_s:
    W^i_ℓ = W^i_{ℓ1} + Mamba( LayerNorm(W^i_{ℓ1}) )  # Inner Mamba
for i in 1n_s:
    S^i_{ℓ½} = S^i_{ℓ1} + FC( Vec(W^i_{ℓ}) )         # Word to sentence
S_{ℓ} = S_{ℓ½} + Mamba( LayerNorm(S_{ℓ½}) )          # Outer Mamba
This hierarchy achieves O(n)O(n) time per block, outperforming self-attention-based SOTA by up to 10×10\times in speed and >70%>70\% in peak memory.

Multimodal, Point Clouds, and 3D

Pamba (Li et al., 25 Jun 2024) employs ConvMamba blocks, which serialize unordered point clouds via multiple space-filling curves (Hilbert, Morton) and apply bidirectional SSM scans. Local geometric context is aggregated by sparse convolution, while global context arises from the Mamba scan. MambaFusion (Wang et al., 6 Jul 2025) extends this model, adding height-fidelity encoding and Hybrid Mamba Blocks that alternate local and global mixing in both raw and bird’s-eye view spaces.

Super-Resolution and Hierarchical Blocks

Hi-Mamba (Qiao et al., 14 Oct 2024) introduces Hierarchical Mamba Blocks (HMBs) combining Local SSMs and Region SSMs with single-direction scanning. Direction alternation (DA-HMG) cascades blocks with cycling scan directions, efficiently covering full 2D spatial context without the 4×4\times overhead of traditional 2D SSM. Experimental results demonstrate up to $0.29$ dB PSNR gains and 61%61\% runtime reduction against multi-directional approaches.

3. Selectivity, Attention Analogy, and Input-Dependent Dynamics

The defining feature of Mamba blocks is the selection of state-space parameters as input-dependent functions, drawing a direct analogy to the "attention maps" in transformers. In CrackMamba (He et al., 22 Jul 2024), selection mechanisms generate per-position matrices AkA_k, BkB_k, CkC_k conditioned on xkx_k, akin to how self-attention computes adaptive weights: Bk=WBxk+bB,Ck=WCxk+bC,Ak=exp(A0+WAxk+bA)B_k = W_B x_k + b_B,\quad C_k = W_C x_k + b_C,\quad A_k = \exp(A_0 + W_A x_k + b_A) In the attention perspective, these token-wise kernels replace the explicit pairwise attention computations with a scan that computes dynamic context weighting. This design, when fused with convolutional attention maps, yields robust global receptive fields and parameter/memory savings.

4. Computational Complexity: Linear-Time Scaling and Pragmatic Implementation

Across domains, Mamba block complexity is O(LNd)O(LNd) (with LL token length, NN hidden size, dd feature dim), compared to O(L2d)O(L^2 d) for full self-attention. For models with nested blocks or multi-scale flows (e.g., MiM-ISTD, ms-Mamba (Karadag et al., 10 Apr 2025)), the additional cost from inner blocks or multiple scales is negligible, with mnm \ll n, cDc \ll D for inner word modeling and SLS \ll L for multi-scale fusion. Parallel scan implementations, hardware-aware dataflow tricks, and bidirectional sweeps further enable linear-time inference even for very long sequences; fitting whole-SOTA 3D fusion pipelines into single-GPU memory [MambaFusion, (Wang et al., 6 Jul 2025)].

5. Hybrid Compositions and Integration with Other Paradigms

Hybrid architectures interleave Mamba blocks with attention, CNNs, or Mixture-of-Experts (MoE) for architectural flexibility.

  • MambaVision (Hatamizadeh et al., 10 Jul 2024) employs Mamba blocks for early multi-scale mixing, substituting MHSA in deeper layers for global relational modeling, with ablations favoring late-stage attention for best overall accuracy.
  • Jamba (Lieber et al., 28 Mar 2024) uses heavy interleaving (ratio 1:7 of attention:Mamba) with MoE on selected Mamba layers. This approach reduces KV cache memory from $32$ GB to $4$ GB for $256$K contexts, achieves 3×3\times inference throughput gains, and matches or surpasses Llama-2-70B and Mixtral across benchmarks.
  • PKD (Medina et al., 3 Mar 2025): Mamba blocks in student models enable aggressive model compression (down to 1%1\% teacher FLOPs for weak students), yielding ensembles that closely track teacher accuracy with major resource savings.

6. Empirical Impact and Performance Benchmarks

Empirical evaluations demonstrate consistently competitive or SOTA performance in diverse contexts:

  • MiM-ISTD (Chen et al., 4 Mar 2024): 10×10\times speed, 73%73\% lower GPU memory, matched/better IoU and nIoU on NUAA-SIRST/IRSTD-1k.
  • PKD (Medina et al., 3 Mar 2025): MNIST ensemble (98%98\% acc @ 63%63\% teacher FLOPs), CIFAR-10 ensemble (86%86\% @ 20%20\% FLOPs). Weak learners at 15%1–5\% compute maintain 5072%50–72\% accuracy.
  • Hi-Mamba (Qiao et al., 14 Oct 2024): +0.29+0.29 dB PSNR gain for SR, >50%>50\% FLop savings, multi-directional coverage with single scans.
  • MambaFusion (Wang et al., 6 Jul 2025): $75.0$ NDS/$72.7$ mAP at 4.7 FPS—exceeds UniTR by +1.7+1.7 NDS/+2.2+2.2 mAP, 1.5×1.5\times faster than alternative global fusion.
  • CrackMamba (He et al., 22 Jul 2024): +1.41+1.41 miIoU/+1.10+1.10 miDice (Deepcrack), $5.84/4.38$ gain on Steelcrack, all at >40%>40\% parameter/MAC reduction.

7. Design Choices, Hyperparameters, and Limitations

Key block parameters include SSM hidden size (N)(N), directionality (single vs multi), scan pattern, convolution/gating factors, block depth, bidirectionality (enabling both forward/reverse passes), and fusion strategies (e.g., attention, channel blending).

Hardware-aware hyperparameter selection (e.g., block count, kernel size, expansion factor) allows efficient deployment within FLOPs and latency constraints. Limitations include increased hyperparameter and parameter count for multi-scale or nested blocks, and potential modeling deficits if single-scan directionality fails to capture spatial relationships (addressed by DA-HMG and hybrid compositions).

A plausible implication is that future extensions could leverage token pruning within SSM scans, adapt block types to multimodal fusion with conditional SSM kernels, or blend SSMs with more powerful spatial convolution for low-level vision tasks, and integrate block selection/dynamic routing for further acceleration.

Summary Table: Computational Complexity in Representative Mamba Block Variants

Model/Task Mamba Block Complexity Transformer Complexity Empirical Speed / Memory Gains
MiM-ISTD (Chen et al., 4 Mar 2024) O(nD2)O(n D^2) (linear nn) O(n2D)O(n^2 D) 10×10\times faster, 73%73\% less memory
Hi-Mamba (Qiao et al., 14 Oct 2024) O(HWC)O(HW C) per block O((HW)2C)O((HW)^2 C) +0.29+0.29 dB PSNR, 61%61\% less time
MambaFusion (Wang et al., 6 Jul 2025) O(NC2)O(N C^2) (tokens NN) O(N2)O(N^2) 1.5×1.5\times faster, +2.2+2.2 mAP
Pamba (Li et al., 25 Jun 2024) O(Nd)O(N d) O(N2d)O(N^2 d) 2×2\times faster, $1/3$ GPU memory
PKD (Medina et al., 3 Mar 2025) O(Ld2+LSSM2)O(L d^2 + L SSM^2) Up to 99%99\% teacher accuracy, <63%<63\% FLOPs

Mamba blocks thus serve as modular, domain-adaptive, and highly efficient alternatives to attention for global modeling, enabling a new generation of scalable deep architectures across modalities, sequence lengths, and application domains.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mamba Blocks.