Sparse Deformable Mamba Models

Updated 8 February 2026

Sparse Deformable Mamba is a class of neural architectures that employs data-driven token relevance scoring for adaptive, sparsity-controlled sequencing across multiple feature domains.
It integrates state-space model (SSM)-based Mamba blocks with spatial, spectral, temporal, and geometric modules to efficiently process high-dimensional inputs with linear computational cost.
Experimental outcomes in hyperspectral classification, point cloud analysis, and multimodal registration demonstrate significant performance gains and reduced FLOPs.

Sparse Deformable Mamba refers to a class of neural architectures that integrate linear-time state-space models (SSMs)—notably, Mamba blocks—with learned, adaptive token sequencing mechanisms that enforce both sparsity and deformability in the input token order. Unlike traditional SSMs or transformers, these models reduce computational cost by focusing computation on a small, dynamically selected subset of tokens most relevant for the given task and modality. This paradigm has shown significant empirical success in hyperspectral image (HSI) classification, multimodal image registration, point cloud analysis, remote sensing change detection, and large-scale time-series learning (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025, Wen et al., 2024, Liu et al., 3 Dec 2025, Sun et al., 19 Sep 2025).

1. Sparse Deformable Sequencing: Principles and Formulations

Sparse Deformable Sequencing (SDS) is the core mechanism underlying Sparse Deformable Mamba. Rather than consuming static, dense token sequences (e.g., raster-scanned pixels or fixed-band orderings), SDS adaptively constructs token sequences by (a) calculating a data-driven relevance score for each candidate token relative to an anchor (central pixel, spectral band, or point), and (b) selecting only the top-ranked tokens as determined by a user-set sparsity ratio $\lambda\in(0,1)$ . The process is repeated independently in various feature domains (spatial, spectral, temporal, geometric), and the resultant sequences are then processed by SSM-based Mamba blocks.

For example, in hyperspectral images, SDS computes for each $i$ -th spatial token: $\mathrm{SparseSpatialAttn}_i = \arccos\left( \frac{z_j^\top z_i}{\|z_j\|\,\|z_i\|} \right),$ where $z_j$ is the anchor and $z_i$ is a candidate. The $m_{\rm spa}=\lceil\lambda HW\rceil$ most relevant tokens are selected and their order is deformable (i.e., learned and data-adaptive). Spatial and spectral modules use analogous formulations, employing either cosine similarity or attention matrices for token scoring (Xu et al., 13 Apr 2025).

This principle generalizes. In point clouds, offset-guided Gaussian reordering and resampling are employed to construct an optimal point sequence, where points are shifted spatially and/or in serialization order, and sequencing is performed via soft differentiable assignments (Liu et al., 3 Dec 2025). In temporal or multi-modal contexts, token selection operates on temporal slices or modality-specific features (Dewis et al., 29 Jul 2025, Wen et al., 2024). Across applications, SDS/SDMS mechanisms yield sparsity (computing on $K\ll N$ tokens) and deformability (sequences adapt during learning).

2. Architectural Modules: Spatial, Spectral, Temporal, and Geometric Mamba

Sparse Deformable Mamba architectures instantiate the SDS principle in task-specific modules:

Sparse Deformable Spatial Mamba Module (SDSpaM): Processes a sparse, learned selection of spatial tokens through a Mamba block, then scatters the outputs back and adds them residually to the spatial feature map (e.g., $H \times W \times d$ for images) (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).
Sparse Deformable Spectral Mamba Module (SDSpeM): Operates identically but along the spectral axis. Only the most relevant bands are processed (Xu et al., 13 Apr 2025).
Sparse Deformable Temporal Mamba Module (SDTM): Compresses long time-series by selecting only the most salient temporal slices per pixel or patch (Dewis et al., 29 Jul 2025).
Offset-guided Deformable Mamba Blocks in Geometry (DM3D): Utilize spatial offsets and soft sequence reordering to enable structure-adaptive point serialization in point cloud tasks (Liu et al., 3 Dec 2025).

All modules implement their SSM (Mamba) component in a way that FlOPs and memory scale only linearly with sequence length. The critical enabling mechanism is that the selection and reordering functions—whether based on cosine similarity, softmax attention, or Gaussian weights—are differentiable and trained end-to-end with the rest of the network.

3. Attention-Based Feature Fusion and Task-Specific Integration

Fusion of modality-specific outputs is addressed via attention-based fusion layers that learn dynamic weighting between spatial, spectral, and/or temporal branches. For example, feature maps from SDSpaM and SDSpeM are fused via multi-head attention: $Q = Z'_j W_Q, \quad K = A'_j W_K, \quad V = A'_j W_V, \quad \mathrm{Attn}(Q,K,V) = \mathrm{Softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)V$ This produces a fused feature map for downstream classification or segmentation (Xu et al., 13 Apr 2025). The integration flexibility of SDS modules allows their combination with convolutional feature extractors (for local context), U-Nets (for global structure), or frequency-domain fusion (for complementary feature modulation).

In multi-temporal remote sensing, deformable alignment modules (e.g., Bi-Temporal Deformable Alignment, BTDA) further ensure spatial/temporal consistency by predicting bounded offsets and modulation gates, followed by scale-sparse feature amplification to enhance change signals while suppressing noise (Sun et al., 19 Sep 2025).

4. Computational Efficiency and Theoretical Properties

Sparse Deformable Mamba yields significant computational advantages. The cost of SSM modules drops from $\mathcal O(N d^2)$ to $i$ 0 for sparsity ratio $i$ 1, with negligible impact on parameter count (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025). For instance, in HSI, a reduction from 416.23 M to 172.41 M FLOPs (about 60%) is observed, with accuracy improving rather than degrading (Xu et al., 13 Apr 2025). In time-series applications, sparsifying simultaneously along spatial, spectral, and temporal axes permits modeling of long sequences ( $i$ 2, $i$ 3, $i$ 4) with a 4–5 $i$ 5 speedup and memory savings (Dewis et al., 29 Jul 2025).

Ablation studies show that the effective sparsity and deformability achieved by SDS are crucial—static or dense sequencing yields both higher computation and lower accuracy or worse detail preservation, particularly for rare or small classes.

5. Experimental Outcomes Across Domains

Sparse Deformable Mamba architectures have demonstrated superior empirical results across diverse tasks:

Setting	Method	OA (%) / F1 (%) / mIoU	Key Notes
HSI classification	SDMamba (Xu et al., 13 Apr 2025)	99.44 / — / —	100% for tiny classes; OA up to 99.44
MODIS land cover	STSMamba (Dewis et al., 29 Jul 2025)	97.59 OA	16 $i$ 6 sequence reduction
Point cloud	DM3D (Liu et al., 3 Dec 2025)	93.76 OA (ModelNet40)	State-of-the-art, GKR + GDR crucial
Change detection	DC-Mamba (Sun et al., 19 Sep 2025)	F1: 0.5903; IoU: 0.4187	Outperforms ChangeMamba baseline
Multi-modal reg.	MambaReg (Wen et al., 2024)	Dice: 83.44	Superior to prior SSM-based methods

These gains derive from the synergy of reduced redundancy, adaptive tokenization, and stateful aggregation via SSMs. Notably, SDMamba and STSMamba models, when applied to hyperspectral and land cover datasets, surpass Transformer, CNN, and previous Mamba baselines, especially in preserving fine-grained or rare-class details and maintaining low computational cost (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025).

6. Extensions and Variants: Geometry, Registration, and Beyond

The sparse deformable sequencing paradigm has been extended to irregular domains (e.g., 3D point clouds), where serialization order is non-canonical and spatial neighborhoods are non-uniform (Liu et al., 3 Dec 2025). Techniques such as offset-guided KNN resampling and differentiable sorting (Gaussian-based) permit learned, data-adaptive serialization for SSM modules. Variants such as DC-Mamba introduce explicit alignment operations to handle geometric misalignment in temporal or multimodal imagery. Hybrid sparsity mechanisms (e.g., convolutional sparse coding with Mamba-based state recurrences) further extend applicability to medical image registration and related tasks (Wen et al., 2024).

7. Outlook and Theoretical Considerations

The demonstrated successes of Sparse Deformable Mamba models suggest a plausible broader trend: selective, adaptive sequencing provides a scalable alternative to dense, fixed, or heuristic tokenization in both regular and irregular data domains. The combination of differentiable sparsity and deformable ordering naturally complements the linear-time modeling of SSMs, avoids the quadratic bottlenecks of attention, and provides task- and context-sensitive feature selection. Current work has illuminated key gains in remote sensing, scientific imaging, point cloud understanding, and multimodal alignment.

Further theoretical analysis of the expressive power, sparsity–accuracy tradeoffs, and application-dependent requirements for sequence adaptivity remain open and active areas for research, as does integration with generative modeling and unsupervised learning frameworks (Xu et al., 13 Apr 2025, Dewis et al., 29 Jul 2025, Liu et al., 3 Dec 2025, Wen et al., 2024, Sun et al., 19 Sep 2025).