Deformable Token Mamba Block (DTMB)

Updated 25 August 2025

DTMB is an advanced module in state space models that forms and adapts token sequences for complex spatial and sequential data.
It integrates deformable token formation, centralized multi-directional scanning, and Gaussian decay weighting to enhance feature aggregation.
The design improves hyperspectral image classification and remote sensing tasks by offering efficient, center-focused multi-scale fusion.

The Deformable Token Mamba Block (DTMB) represents an advanced module within state space model (SSM)-based deep learning architectures—specifically within Mamba-based models—designed to produce dynamically structured and information-adaptive token sequences for spatial, spectral, and sequential data. Its principal objective is to improve feature aggregation, context adaptivity, and model efficiency by integrating deformable token formation, centralized and multi-directional scanning, adaptive weighting, and multi-stage feature fusion, with demonstrable benefits in hyperspectral image (HSI) classification and other remote sensing and vision tasks (Zhou et al., 20 May 2024).

1. Architectural Foundations and Core Components

At the heart of the DTMB is the Tokenized Mamba (T-Mamba) encoder, which operates on image-sequence inputs generated by the centralized Mamba-Cross-Scan (MCS) mechanism. The encoder incorporates the following distinct stages:

Input projection and directional splitting: Each spatial patch, after normalization and linear projection, is split into bifurcated streams (forward and backward), with stream endpoints corresponding to the patch center.
Convolutional preparation and Mamba (S6) processing: Each direction passes first through 1D convolution and SiLU activation, then into the Mamba S6 operator, whose dynamic parameterization (notably matrices B and C) enables selective receptive field adaptation.
Gaussian Decay Mask (GDM): Applied post-Mamba, GDM implements index- and feature-centric weights:

$w_t = \exp \left( -\frac{1}{2} \left( \frac{t - T}{\sigma_{idx}} \right)^2 \right), \quad v_t = \exp \left( -\frac{1}{2} \left( \frac{||f_t - f_T||}{\sigma_{fea}} \right)^2 \right)$

After normalization, the Hadamard product yields concentrated, center-weighted features:

$\hat{s} = \tilde{z} \odot \operatorname{Norm}(W_{idx} \otimes W_{fea})$

Semantic Token Learner (STL): Comprises sequential max- and average-pooling, channel attention, and projection-based aggregation:

$m = \sigma(Conv1d(\operatorname{Concat}(s_{max}, s_{ave})))$

Semantic tokens are extracted by:

$u = \operatorname{Softmax}((\hat{s} \odot U_1)^T) \odot (\hat{s} \odot U_2)$

Semantic Token Fuser (STF): Fuses semantic tokens with adaptively pooled original features, using a channel-wise sigmoid gate:

$\tilde{z}' = \sigma(\operatorname{AdaptivePool}(\operatorname{SiLU}(z)) \odot Z)$

The fused representation:

$\hat{u} = (\tilde{z}' \odot u) + \operatorname{SeqAttn}(\tilde{z}')$

Merge operation: Directional sequences are merged to maintain positional information and ensure the prominence of the center pixel, followed by projection to re-align to the requisite feature space.

This design positions DTMB as a generalization of both deformable convolutions and state space sequence encoders, promoting enhanced adaptability to spatial, spectral, and contextual patterns inherent in complex imagery.

2. Centralized Mamba-Cross-Scan (MCS) and Sequence Construction

The MCS mechanism transforms input patches into ordered sequences by emulating spatially continuous scan paths. It utilizes snake- or U-Turn scanning, avoiding discontinuities across patch edges:

Multi-directional scan types: Four scan directions (e.g., top-left to bottom-right) are defined, each generating forward and backward sub-sequences containing the center pixel as the anchor. Table summarizing scan strategy:

MCS Type	Start-End Anchors	Sequence Orientation
Type 1	Top-left ↔ Bottom-right	Forward and reversed
Type 2	Top-right ↔ Bottom-left	Bidirectional, mirrored
...	...	...

Continuity and efficiency: This approach preserves spatial adjacency, reduces redundancy, and mitigates the boundary artifacts encountered in raster scanning—critically important for pixel-level, region-centric prediction tasks.

The MCS-generated sequences are individually processed in the T-Mamba encoder (DTMB), ensuring each spatial context is fully exploited from multiple geometric perspectives.

3. Weighted Fusion and Multi-Scale Representation

Following sequence processing in each direction, the Weighted MCS Fusion (WMF) and multi-scale design act as the backbone for ensemble feature synthesis:

Weighted aggregation: The outputs from separate T-Mamba encoders (corresponding to each MCS type) are aggregated via learnable weights:

$O_p^{(i, j)} = k_1 o_1^{(i, j)} + k_2 o_2^{(i, j)} + k_3 o_3^{(i, j)} + k_4 o_4^{(i, j)}$

where $\sum_o k_o = 1$ . This adaptive fusion allows the architecture to emphasize or suppress scan branches based on information utility.

Multi-scale loss: Each down-sampling stage contributes a scale-specific classification loss, with overall loss defined as:

$\mathcal{L}_{total} = \frac{1}{n} \left( \mathcal{L}_p + \mathcal{L}_{p_2} + \ldots + \mathcal{L}_{p_n} \right)$

This stratified supervision encourages progressive feature refinement and enhances local-global context sensitivity.

Such multi-level weighing and error attribution is crucial for robust learning, especially where local context may obscure class boundaries, as in HSI and dense remote sensing data.

4. Empirical Performance and Ablative Evidence

DTMB, situated within the Mamba-in-Mamba (MiM) architecture, has been rigorously evaluated against state-of-the-art frameworks including Vision Transformer (ViT), SpeFormer, MAEST, and HiT on benchmark HSI datasets (Indian Pines, Pavia University, Houston 2013):

Quantitative gains: MiM achieved superior overall accuracy (OA), average accuracy (AA), and Kappa coefficient under both fixed and disjoint train/test protocols.
Improved spatial coherence: The design yielded smoother classification maps with sharper land-cover boundaries and significant mitigation of over-smoothing or label confusion, even under training data scarcity.
Component ablation: Removal or isolation of STL, GDM, and STF resulted in measurable accuracy declines (as shown in controlled ablation studies), underscoring their necessity for information aggregation and center-focused learning.
Efficiency: The method achieves these benefits with reduced computational overhead compared to transformer-based approaches, owing to the linear scaling properties of SSMs and the selective processing of deformable token paths.

5. Relationships to Broader State Space and Transformer-based Methods

DTMB embodies successive advances in the integration of dynamic sequence encoding, attention-based aggregation, and spatial deformability:

Contrast with RNNs/Transformers: RNNs show poor centric aggregation and transformer models impose excessive compute overhead and data requirements in HSI settings; DTMB circumvents these challenges through efficient and concentrated aggregation on critical regions, guided by both geometric and featurewise proximity.
Deformability as adaptation: The module’s combination of deformable sequence scanning, central weighting, and semantic condensation is directly inspired by the need to move beyond rigid, context-agnostic scanning, aligning with recent trends in content-adaptive representation learning.

This suggests a general strategy for encoding local and non-local dependencies in spatially structured data without incurring quadratic cost, as with pure self-attention.

6. Application Domains and Prospects

DTMB is applicable across a range of tasks requiring spatially adaptive sequence modeling:

HSI classification and remote sensing: Its focus on centric aggregation and multi-directional scanning is particularly tailored for tasks such as land-cover mapping, agricultural landscape analysis, urban/rural boundary discrimination, and environmental surveillance.
Dense prediction and change detection: The multi-scale, deformable token generation framework is also suitable for change detection in time series imagery and fine-grained segmentation tasks.
Vision applications beyond remote sensing: The deformable and token-centric scheme is extendable to object detection, scene understanding, and few-shot learning. Its abstraction principles (center-focused Gaussian reweighting, scanning-based sequence formation, semantic fusing) offer a general recipe for efficient representation across vision domains demanding local and global context synergy (Zhou et al., 20 May 2024).

7. Conclusion

The Deformable Token Mamba Block (DTMB) consolidates advances in adaptive sequence construction, deformable contextual weighting, semantic condensation, and dynamic multi-branch fusion for sequence modeling in spatially structured data. Implemented as the T-Mamba encoder within the MiM framework, DTMB provides a principled, empirically validated architecture well suited for data domains where spatial, spectral, and sequential features are coupled and critical information is highly localized. The block’s modularity, efficiency, and performance gains position it as a central component in next-generation remote sensing and computer vision systems.

PDF Markdown Chat (Pro)

References (1)

Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Deformable Token Mamba Block (DTMB).