Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spatial-Mamba: Efficient Spatial SSMs

Updated 19 February 2026
  • Spatial-Mamba is a state space model that integrates explicit spatial structure through selective state fusion, capturing both local and global dependencies in 2D/3D data.
  • It employs mechanisms like clustering-guided token selection and multi-dilated convolution to achieve linear complexity and reduce computational load compared to traditional self-attention.
  • Empirical results demonstrate that Spatial-Mamba achieves competitive performance across vision benchmarks, medical imaging, and point cloud tasks, often with significant efficiency gains.

Spatial-Mamba

Spatial-Mamba refers to a class of state space model (SSM) architectures, centered on the Mamba operator, that directly and efficiently model spatial dependencies in 2D (images), 3D (volumes or point clouds), or multi-dimensional signal domains. Unlike conventional transformer-based techniques that rely on self-attention and incur quadratic complexity in sequence length, Spatial-Mamba designs inject explicit spatial structure into the state-space recurrence—often via neighborhood-aware fusion, clustering, or multi-scale decomposition—yielding highly expressive yet computationally efficient models suitable for large-scale visual and spatio-temporal tasks.

1. Theoretical Foundations and Spatial SSM Formulation

Spatial-Mamba generalizes the selective SSM formalism to multidimensional arrays by augmenting the classical 1D state recurrence with mechanisms to capture local spatial dependencies. The canonical time-invariant linear SSM is defined by:

xt=Aˉtxt1+Bˉtut,yt=Ctxt+utx_{t} = \bar{A}_t x_{t-1} + \bar{B}_t u_t, \qquad y_t = C_t x_t + u_t

where, in standard Mamba, the transition matrices are made input-dependent (“selective”) and the input utu_t could be the embedding of a pixel, voxel, patch, or token. For spatial tasks, direct application of a flattened scan order is suboptimal, as it ignores adjacency and neighborhood structure, thereby distorting both local and global context (Xiao et al., 2024).

Spatial-Mamba innovates by introducing structure-aware state fusion (SASF):

xt=Aˉtxt1+Bˉtut,ht=kΩαkxρk(t),yt=Ctht+utx_{t} = \bar{A}_t x_{t-1} + \bar{B}_t u_t, \qquad h_t = \sum_{k \in \Omega} \alpha_k x_{\rho_k(t)}, \qquad y_t = C_t h_t + u_t

where Ω\Omega indexes the spatial neighborhood around tt and ρk(t)\rho_k(t) maps to a local neighbor of tt. αk\alpha_k are learnable weights. This architecture can be realized via multi-dilated depth-wise convolution over the 2D array of hidden states. This fusion step enables a single scan to capture both long-range dependencies and fine-grained local context in the image grid (Xiao et al., 2024).

2. Architectural Variants: Clustering, Decomposition, and Attention Mechanisms

Spatial-Mamba has seen extensive architectural evolution, including:

A. Clustering-Guided Spatial Mamba (CSpaMamba)

CSpaMamba adaptively decomposes the image into KK clusters using a learnable clustering module, with the following workflow (Dewis et al., 22 Jan 2026):

  1. Learnable Clustering Module: Prototypes ckc_k initialized and updated by exponential moving average. Each token xix_i is assigned a soft membership pi,kp_{i,k} (via Gaussian kernel), and hard cluster indices for routing.
  2. Attention-Driven Token Selection: In each cluster, a hybrid importance score ranks tokens according to a weighted combination of (i) dynamic attention, (ii) static cluster membership. Only the top-K^\hat{K} tokens are retained.
  3. Two-Stage Local-Global Decomposition:
    • Each cluster’s reduced sequence is processed by a local Mamba block (per-cluster).
    • Cluster outputs are scattered back, summed, and passed through a global Mamba block over the full spatial grid.
    • Sequence length is reduced from LL to CK^C \hat{K}, offering a (L/(CK^))(L/(C \hat{K})) complexity reduction.

B. Structure-Aware State Fusion and Multi-Scale Extensions

Spatial-Mamba introduces explicit spatial convolution in the state-fusion, typically through multi-dilated, depth-wise kernels, thus boosting local structure awareness (Xiao et al., 2024). Extensions also include channel grouping, multi-directional or multi-axis scans (across x/y/z or planes), and multi-scale decomposition through down/up-sampling in neural encoders (Jin et al., 15 Mar 2025, Sun et al., 10 Nov 2025).

C. Integrative Mechanisms

Modern designs often fuse spatial Mamba with spectral (or channel/temporal) Mamba blocks, clustering, multi-head attention, and/or auxiliary modules (e.g., frequency-domain processing, cross-modal fusion) (Dewis et al., 22 Jan 2026, He et al., 2024, Sun et al., 10 Nov 2025). Integration is commonly performed by summing or concatenating the outputs of spatial and spectral branches, or by adaptive, attention-driven gating.

3. Integration in Spatial-Spectral and Spatio-Temporal Frameworks

The spatial Mamba block constitutes the spatial branch in advanced spatial-spectral and spatio-temporal networks:

  • Spatial-Spectral Mamba: Separate Mamba blocks process spatial tokens (across 2D grids) and spectral tokens (across frequency bands), followed by residual fusion. In the MambaHSI architecture, SpaMB flattens pixels into a grid and models long-range spatial interactions at pixel level; SpeMB splits the embedding into spectral groups for cross-band processing. A fusion module integrates both (Li et al., 9 Jan 2025).
  • Clustering-Guided Approaches: Clustering (e.g., CSSMamba) or deformable sparse selection (e.g., SDSpaM) further restrict the spatial Mamba’s attention to the most informative token subsets, often guided by data-driven criteria (proximity to a learned prototype, hybrid attention/static importance), drastically reducing computation while focusing on boundary or class-critical regions (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).
  • Spatio-Temporal Extensions: In spatial-temporal (e.g., EEG, traffic, video) or spatial-temporal-graph models, the spatial Mamba encodes across spatial channels/locations at each time step (or frame), generating features that are fused or recursively processed by temporal Mamba layers (Yuan et al., 2024, Yang et al., 2024, Tang et al., 9 Jul 2025).

4. Computational Complexity and Efficiency Considerations

Spatial-Mamba attains marked efficiency advances over self-attention-based architectures:

  • Complexity: For LL tokens of dimension NN, spatial Mamba attains O(LN)O(LN) time and memory per block—unlike global attention’s O(L2N)O(L^2N) scaling.
  • Sequence Reduction: Adaptive token selection, clustering, and pruning (e.g., top-K^\hat{K} per cluster, sparse deformable sampling) further cut compute, yielding up to 42×42\times reduction in spatial block FLOPs (SDSpaM) or more (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).
  • Memory: Absence of global pairwise attention allows linear memory scaling even for full-sized images or 3D volumes.

Ablation studies consistently confirm that spatial Mamba, and especially its clustering/sparse variants, achieve substantial savings over dense spatial processing, often—counterintuitively—improving classification performance by focusing on truly discriminative cues (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).

5. Empirical Performance in Vision, Remote Sensing, and Medical Imaging

Spatial-Mamba has demonstrated state-of-the-art or competitive performance across a diverse range of domains:

  • Standard Vision Benchmarks:
    • On ImageNet-1K: Spatial-Mamba-T/S/B deliver top-1 accuracy of 83.5%/84.6%/85.3% at 4.5G/7.1G/15.8G FLOPs, outperforming previous SSM-based models (Xiao et al., 2024).
    • On COCO (detection/segmentation): APb^b up to 50.4 and APm^m up to 45.1.
  • Hyperspectral Image Classification:
    • CSSMamba with CSpaMamba achieves higher accuracy and improved boundary preservation over CNN, Transformer, and Mamba-only baselines on Pavia University, Indian Pines, and Liao-Ning 01 (Dewis et al., 22 Jan 2026).
    • IGroupSS-Mamba and 3DSS-Mamba attain OA >98% (across multiple benchmarks) with model sizes orders of magnitude lower than transformer counterparts (He et al., 2024, He et al., 2024).
  • Medical Imaging and MRI Reconstruction:
  • Point Clouds and 3D Object Detection:
    • ZigzagPointMamba, integrating spatial Mamba with semantic masking, improves ShapeNetPart mIoU by +1.59% over prior PointMamba (Diao et al., 27 May 2025).
    • UniMamba for LiDAR 3D detection achieves 70.2 mAP on nuScenes, setting new state-of-the-art (Jin et al., 15 Mar 2025).
  • Brain-Computer Interface (EEG decoding):
    • STMambaNet, employing spatial Mamba encoders, achieves 82.4% mean accuracy on BCI-IV-2a, exceeding all baselines (Yang et al., 2024).

6. Limitations, Design Trade-offs, and Future Directions

While the Spatial-Mamba paradigm offers substantial advances, several limitations and opportunities are evident:

  • Static vs. Adaptive Locality: Most implementations still use fixed neighborhoods (e.g., dilated convolutions); dynamic or content-adaptive neighborhood selection is a known area for improvement (Xiao et al., 2024).
  • Clustering and Token Routing: Quality of local partitioning (e.g., clustering) is sensitive to hyperparameters, and poor initialization or adaption can degrade performance; stabilization via EMA and regularization is used in modern variants (Dewis et al., 22 Jan 2026).
  • Unidirectional Scanning and Information Limits: Standard spatial Mamba typically uses a single scan direction; bidirectional or multi-path strategies (e.g., complementary Z-order, multi-axis cross-scan) show further gains, especially in volumetric or 3D contexts (Jin et al., 15 Mar 2025, Zeng et al., 17 Aug 2025).
  • Configurability and Generalization: Spatial-Mamba modules are often composed with other modules (temporal, spectral, frequency), and optimal integration/fusion strategies remain under investigation.
  • Open Questions: How to best learn spatial neighborhoods, exploit domain structure (e.g., manifold topology in medical or remote-sensing data), and balance sparsity with expressivity.

Ongoing research explores: dynamic graph-based state fusion, spatial Mamba for video and long-sequence applications, adaptive cluster allocation, and integration with hierarchical SSM stacks (Xiao et al., 2024, Dewis et al., 22 Jan 2026).

7. Representative Architectures and Comparative Table

Framework Spatial Mamba Variant / Mechanism Domain Efficiency / OA Key Innovations
CSSMamba (Dewis et al., 22 Jan 2026) Cluster-guided, sparse token selection HSI OA ↑, seq. ↓(L/(CK^))(L/(C\hat{K})) Learnable clustering, hybrid attn
Spatial-Mamba (Xiao et al., 2024) SASF via multi-dilated convolution Vision 83.5–85.3% (ImageNet) Structure-aware state fusion
SDSpaM (STSMamba)(Dewis et al., 29 Jul 2025) Sparse deformable selection MODIS / time series 42× FLOP drop, OA↑ Cosine-angle scoring, K=4
SRMA-Mamba (Zeng et al., 17 Aug 2025) Volumetric multi-plane scan MRI vol seg Dice +1.15%, 4× comp drop 3D state scans, reverse attention
MambaHSI (Li et al., 9 Jan 2025) Full-image, pixel-level flattening HSI Linear (O(LD)) comp. Global spatial mixing, SSFM
UniMamba (Jin et al., 15 Mar 2025) SubConv3D, comp. Z-order, local/global LiDAR 3D det 70.2 mAP, SOTA Locality & global context fusion
ZigzagPointMamba (Diao et al., 27 May 2025) Zigzag scan, semantic masking Point cloud mIoU +1.59% Spatial continuity, hybrid mask
SFMFusion (Sun et al., 10 Nov 2025) Multi-scale & freq. enhanced block MMIF Ranked 1st or 2nd, 6 datasets MMB, CEB, FEB + dynamic fusion

Note: OA = Overall Accuracy; MMIF = Multi-modal image fusion; SOTA = state of the art.

References

  • "Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification" (Dewis et al., 22 Jan 2026)
  • "Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion" (Xiao et al., 2024)
  • "MMR-Mamba: Multi-Modal MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion" (Zou et al., 2024)
  • "Spatial-Temporal-Spectral Mamba with Sparse Deformable Token Sequence for Enhanced MODIS Time Series Classification" (Dewis et al., 29 Jul 2025)
  • "MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification" (Li et al., 9 Jan 2025)
  • "SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes" (Zeng et al., 17 Aug 2025)
  • "UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection" (Jin et al., 15 Mar 2025)
  • "ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding" (Diao et al., 27 May 2025)
  • "Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion" (Sun et al., 10 Nov 2025)
  • Additional sources as referenced within each section.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial-Mamba.