Spatial-Mamba: Efficient Spatial SSMs
- Spatial-Mamba is a state space model that integrates explicit spatial structure through selective state fusion, capturing both local and global dependencies in 2D/3D data.
- It employs mechanisms like clustering-guided token selection and multi-dilated convolution to achieve linear complexity and reduce computational load compared to traditional self-attention.
- Empirical results demonstrate that Spatial-Mamba achieves competitive performance across vision benchmarks, medical imaging, and point cloud tasks, often with significant efficiency gains.
Spatial-Mamba
Spatial-Mamba refers to a class of state space model (SSM) architectures, centered on the Mamba operator, that directly and efficiently model spatial dependencies in 2D (images), 3D (volumes or point clouds), or multi-dimensional signal domains. Unlike conventional transformer-based techniques that rely on self-attention and incur quadratic complexity in sequence length, Spatial-Mamba designs inject explicit spatial structure into the state-space recurrence—often via neighborhood-aware fusion, clustering, or multi-scale decomposition—yielding highly expressive yet computationally efficient models suitable for large-scale visual and spatio-temporal tasks.
1. Theoretical Foundations and Spatial SSM Formulation
Spatial-Mamba generalizes the selective SSM formalism to multidimensional arrays by augmenting the classical 1D state recurrence with mechanisms to capture local spatial dependencies. The canonical time-invariant linear SSM is defined by:
where, in standard Mamba, the transition matrices are made input-dependent (“selective”) and the input could be the embedding of a pixel, voxel, patch, or token. For spatial tasks, direct application of a flattened scan order is suboptimal, as it ignores adjacency and neighborhood structure, thereby distorting both local and global context (Xiao et al., 2024).
Spatial-Mamba innovates by introducing structure-aware state fusion (SASF):
where indexes the spatial neighborhood around and maps to a local neighbor of . are learnable weights. This architecture can be realized via multi-dilated depth-wise convolution over the 2D array of hidden states. This fusion step enables a single scan to capture both long-range dependencies and fine-grained local context in the image grid (Xiao et al., 2024).
2. Architectural Variants: Clustering, Decomposition, and Attention Mechanisms
Spatial-Mamba has seen extensive architectural evolution, including:
A. Clustering-Guided Spatial Mamba (CSpaMamba)
CSpaMamba adaptively decomposes the image into clusters using a learnable clustering module, with the following workflow (Dewis et al., 22 Jan 2026):
- Learnable Clustering Module: Prototypes initialized and updated by exponential moving average. Each token is assigned a soft membership (via Gaussian kernel), and hard cluster indices for routing.
- Attention-Driven Token Selection: In each cluster, a hybrid importance score ranks tokens according to a weighted combination of (i) dynamic attention, (ii) static cluster membership. Only the top- tokens are retained.
- Two-Stage Local-Global Decomposition:
- Each cluster’s reduced sequence is processed by a local Mamba block (per-cluster).
- Cluster outputs are scattered back, summed, and passed through a global Mamba block over the full spatial grid.
- Sequence length is reduced from to , offering a complexity reduction.
B. Structure-Aware State Fusion and Multi-Scale Extensions
Spatial-Mamba introduces explicit spatial convolution in the state-fusion, typically through multi-dilated, depth-wise kernels, thus boosting local structure awareness (Xiao et al., 2024). Extensions also include channel grouping, multi-directional or multi-axis scans (across x/y/z or planes), and multi-scale decomposition through down/up-sampling in neural encoders (Jin et al., 15 Mar 2025, Sun et al., 10 Nov 2025).
C. Integrative Mechanisms
Modern designs often fuse spatial Mamba with spectral (or channel/temporal) Mamba blocks, clustering, multi-head attention, and/or auxiliary modules (e.g., frequency-domain processing, cross-modal fusion) (Dewis et al., 22 Jan 2026, He et al., 2024, Sun et al., 10 Nov 2025). Integration is commonly performed by summing or concatenating the outputs of spatial and spectral branches, or by adaptive, attention-driven gating.
3. Integration in Spatial-Spectral and Spatio-Temporal Frameworks
The spatial Mamba block constitutes the spatial branch in advanced spatial-spectral and spatio-temporal networks:
- Spatial-Spectral Mamba: Separate Mamba blocks process spatial tokens (across 2D grids) and spectral tokens (across frequency bands), followed by residual fusion. In the MambaHSI architecture, SpaMB flattens pixels into a grid and models long-range spatial interactions at pixel level; SpeMB splits the embedding into spectral groups for cross-band processing. A fusion module integrates both (Li et al., 9 Jan 2025).
- Clustering-Guided Approaches: Clustering (e.g., CSSMamba) or deformable sparse selection (e.g., SDSpaM) further restrict the spatial Mamba’s attention to the most informative token subsets, often guided by data-driven criteria (proximity to a learned prototype, hybrid attention/static importance), drastically reducing computation while focusing on boundary or class-critical regions (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).
- Spatio-Temporal Extensions: In spatial-temporal (e.g., EEG, traffic, video) or spatial-temporal-graph models, the spatial Mamba encodes across spatial channels/locations at each time step (or frame), generating features that are fused or recursively processed by temporal Mamba layers (Yuan et al., 2024, Yang et al., 2024, Tang et al., 9 Jul 2025).
4. Computational Complexity and Efficiency Considerations
Spatial-Mamba attains marked efficiency advances over self-attention-based architectures:
- Complexity: For tokens of dimension , spatial Mamba attains time and memory per block—unlike global attention’s scaling.
- Sequence Reduction: Adaptive token selection, clustering, and pruning (e.g., top- per cluster, sparse deformable sampling) further cut compute, yielding up to reduction in spatial block FLOPs (SDSpaM) or more (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).
- Memory: Absence of global pairwise attention allows linear memory scaling even for full-sized images or 3D volumes.
Ablation studies consistently confirm that spatial Mamba, and especially its clustering/sparse variants, achieve substantial savings over dense spatial processing, often—counterintuitively—improving classification performance by focusing on truly discriminative cues (Dewis et al., 22 Jan 2026, Dewis et al., 29 Jul 2025).
5. Empirical Performance in Vision, Remote Sensing, and Medical Imaging
Spatial-Mamba has demonstrated state-of-the-art or competitive performance across a diverse range of domains:
- Standard Vision Benchmarks:
- On ImageNet-1K: Spatial-Mamba-T/S/B deliver top-1 accuracy of 83.5%/84.6%/85.3% at 4.5G/7.1G/15.8G FLOPs, outperforming previous SSM-based models (Xiao et al., 2024).
- On COCO (detection/segmentation): AP up to 50.4 and AP up to 45.1.
- Hyperspectral Image Classification:
- CSSMamba with CSpaMamba achieves higher accuracy and improved boundary preservation over CNN, Transformer, and Mamba-only baselines on Pavia University, Indian Pines, and Liao-Ning 01 (Dewis et al., 22 Jan 2026).
- IGroupSS-Mamba and 3DSS-Mamba attain OA >98% (across multiple benchmarks) with model sizes orders of magnitude lower than transformer counterparts (He et al., 2024, He et al., 2024).
- Medical Imaging and MRI Reconstruction:
- Spatial Mamba contributes a 0.6 dB PSNR gain in MRI reconstruction (TCM module in MMR-Mamba) and, in SRMA-Mamba, delivers a +1.15% Dice improvement for liver segmentation, while using only ~25% the parameters of prior state-of-the-art (Zou et al., 2024, Zeng et al., 17 Aug 2025, Pan et al., 25 Jul 2025).
- Medical anomaly detection sees SP-Mamba outperforming MambaAD and SimSID in AUROC and F1, with much lower memory/computation (Pan et al., 25 Jul 2025).
- Point Clouds and 3D Object Detection:
- ZigzagPointMamba, integrating spatial Mamba with semantic masking, improves ShapeNetPart mIoU by +1.59% over prior PointMamba (Diao et al., 27 May 2025).
- UniMamba for LiDAR 3D detection achieves 70.2 mAP on nuScenes, setting new state-of-the-art (Jin et al., 15 Mar 2025).
- Brain-Computer Interface (EEG decoding):
- STMambaNet, employing spatial Mamba encoders, achieves 82.4% mean accuracy on BCI-IV-2a, exceeding all baselines (Yang et al., 2024).
6. Limitations, Design Trade-offs, and Future Directions
While the Spatial-Mamba paradigm offers substantial advances, several limitations and opportunities are evident:
- Static vs. Adaptive Locality: Most implementations still use fixed neighborhoods (e.g., dilated convolutions); dynamic or content-adaptive neighborhood selection is a known area for improvement (Xiao et al., 2024).
- Clustering and Token Routing: Quality of local partitioning (e.g., clustering) is sensitive to hyperparameters, and poor initialization or adaption can degrade performance; stabilization via EMA and regularization is used in modern variants (Dewis et al., 22 Jan 2026).
- Unidirectional Scanning and Information Limits: Standard spatial Mamba typically uses a single scan direction; bidirectional or multi-path strategies (e.g., complementary Z-order, multi-axis cross-scan) show further gains, especially in volumetric or 3D contexts (Jin et al., 15 Mar 2025, Zeng et al., 17 Aug 2025).
- Configurability and Generalization: Spatial-Mamba modules are often composed with other modules (temporal, spectral, frequency), and optimal integration/fusion strategies remain under investigation.
- Open Questions: How to best learn spatial neighborhoods, exploit domain structure (e.g., manifold topology in medical or remote-sensing data), and balance sparsity with expressivity.
Ongoing research explores: dynamic graph-based state fusion, spatial Mamba for video and long-sequence applications, adaptive cluster allocation, and integration with hierarchical SSM stacks (Xiao et al., 2024, Dewis et al., 22 Jan 2026).
7. Representative Architectures and Comparative Table
| Framework | Spatial Mamba Variant / Mechanism | Domain | Efficiency / OA | Key Innovations |
|---|---|---|---|---|
| CSSMamba (Dewis et al., 22 Jan 2026) | Cluster-guided, sparse token selection | HSI | OA ↑, seq. ↓ | Learnable clustering, hybrid attn |
| Spatial-Mamba (Xiao et al., 2024) | SASF via multi-dilated convolution | Vision | 83.5–85.3% (ImageNet) | Structure-aware state fusion |
| SDSpaM (STSMamba)(Dewis et al., 29 Jul 2025) | Sparse deformable selection | MODIS / time series | 42× FLOP drop, OA↑ | Cosine-angle scoring, K=4 |
| SRMA-Mamba (Zeng et al., 17 Aug 2025) | Volumetric multi-plane scan | MRI vol seg | Dice +1.15%, 4× comp drop | 3D state scans, reverse attention |
| MambaHSI (Li et al., 9 Jan 2025) | Full-image, pixel-level flattening | HSI | Linear (O(LD)) comp. | Global spatial mixing, SSFM |
| UniMamba (Jin et al., 15 Mar 2025) | SubConv3D, comp. Z-order, local/global | LiDAR 3D det | 70.2 mAP, SOTA | Locality & global context fusion |
| ZigzagPointMamba (Diao et al., 27 May 2025) | Zigzag scan, semantic masking | Point cloud | mIoU +1.59% | Spatial continuity, hybrid mask |
| SFMFusion (Sun et al., 10 Nov 2025) | Multi-scale & freq. enhanced block | MMIF | Ranked 1st or 2nd, 6 datasets | MMB, CEB, FEB + dynamic fusion |
Note: OA = Overall Accuracy; MMIF = Multi-modal image fusion; SOTA = state of the art.
References
- "Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification" (Dewis et al., 22 Jan 2026)
- "Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion" (Xiao et al., 2024)
- "MMR-Mamba: Multi-Modal MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion" (Zou et al., 2024)
- "Spatial-Temporal-Spectral Mamba with Sparse Deformable Token Sequence for Enhanced MODIS Time Series Classification" (Dewis et al., 29 Jul 2025)
- "MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification" (Li et al., 9 Jan 2025)
- "SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes" (Zeng et al., 17 Aug 2025)
- "UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection" (Jin et al., 15 Mar 2025)
- "ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding" (Diao et al., 27 May 2025)
- "Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion" (Sun et al., 10 Nov 2025)
- Additional sources as referenced within each section.