Spatial-Frequency Enhanced Mamba Fusion

Updated 17 November 2025

The paper introduces an innovative fusion strategy that combines state-space Mamba blocks with spatial-frequency augmentation to overcome CNN and Transformer limitations.
This methodology employs a Vision State-Space Module and Frequency Selection Module to achieve linear-complexity global modeling and enhanced high-frequency detail retention.
Empirical results across remote sensing, medical imaging, and generative modeling demonstrate superior efficiency and performance, validated by improved PSNR/SSIM metrics and reduced computation.

Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) refers to a class of architectures that combine state-space modeling (Mamba blocks) with explicit spatial-frequency feature augmentation. These architectures address the limitations of traditional CNNs (limited receptive field) and Transformers (quadratic complexity) by providing linear-complexity global modeling and targeted frequency enhancement. SFMFusion has been instantiated for super-resolution, segmentation, multi-modal fusion, motion perception, change detection, and generative modeling across remote sensing, medical imaging, and general computer vision. Its core innovation is the coordinated fusion of spatial and frequency-domain representations—often via custom blocks that mine, refine, and merge spatial structures and frequency cues—within a scalable state-space framework.

1. Architectural Foundations and Key Components

All SFMFusion variants build on the selective state-space model (Mamba) as the primary global feature mixer. The canonical design is a multi-stream network, with each block fusing spatial and frequency (often Fourier) information. The typical computational graph consists of:

Vision State-Space Module (VSSM): Implements linear-complexity global modeling, replacing self-attention with time/sequence recurrences. In 2D vision, selective scans yield global spatial dependencies efficiently.
Frequency Selection Module (FSM): Applies FFT (or DCT/wavelet) per channel, followed by small convolutional attentional filter-masks to amplify or suppress informative frequency bands. The preferred variant uses two 1×1 convs and GELU in complex–real spectrum, outperforms ReLU gating and raw selection.
Hybrid Gate Module (HGM): Applies pixel-wise gating; splits post-fusion features into two, applies channel- and spatial-attention, and fuses with a learned mask for adaptive local bias re-injection.
Learnable Scaling Adaptors: Per-block scalars (α_l, β_l) that mediate the residual connection strength, addressing conflicts between spatial/frequency streams.

The typical block-level pseudocode (for remote sensing SR) is:

1 2	y = α_l * x_{l-1} + VSSM(LN(x_{l-1})) + FSM(LN(x_{l-1})) x_l = β_l * y + HGM(LN(y)) + FSM(LN(y))

This establishes parallel global/local “streams” with distinct frequency and spatial augmentations, subsequently merged through adaptive gating.

2. Mathematical Formulations and Fusion Strategies

State-Space Modeling

The core SSM (as realized in Mamba) evolves hidden states as: $x_{t+1} = A x_t + B u_t,\qquad y_t = C x_t + D u_t,$ where $A,B,C,D$ are learnable and B, C may be input-dependent for "selective" adaptation. In 2D vision, selective scanning (along spatial axes or directions) yields linear-time mixing, compared to $O(N^2)$ for self-attention.

Frequency Feature Mining

FSM applies channel-wise 2D FFT: $F = \text{FFT2}(X)$ Then,

$A = \text{Conv}_{1\times1}( \text{ReGELU}( \text{Conv}_{1\times1}( \text{Re} F ) ) )$

Finally, inverse FFT returns to the spatial domain: $Z = \text{IFFT2}(A)$ Learned convs implement soft frequency masks for spectrum selection.

Hybrid Gating

Given concatenated features $Y \in \mathbb{R}^{H\times W\times 2C}$ , split into $Y_1, Y_2$ :

$Y_1$ : Channel/coordinate features → Conv, DWConv, Channel-Attention
$Y_2$ : Linear-per-pixel mix, GELU Fusion: $Y_{out} = \text{Conv}_{1\times1}( M \odot X_{coord} )$ $M$ (mask) gates local, spatially-varying activations.

Scaling Adaptors

In FMB, every pre- and post-merge, multiply by block-specific scalar (α, β) prior to addition. Ablations show direct sum of streams without scaling impairs overall performance.

3. Domain-Specific Instantiations

SFMFusion is adapted across several domains:

Remote Sensing Super-Resolution: Frequency-assisted blocks yield superior PSNR/SSIM at 19–28% the memory/FLOPs of Transformer baselines (Xiao et al., 2024). The model structure accommodates large RSIs (>512×512) with linear scaling.
Medical Image SR: SFMFusion combines Gated Attention-enhanced SSMs and Pyramid Frequency Fusion. Explicit multi-scale high-frequency reinjection produces sharper anatomical structures (e.g., vessel/tissue boundaries), outperforming CNN+ViT hybrids with only 0.74M params (Huang et al., 31 Oct 2025).
3D Medical Segmentation: Symmetry-driven dual-branch blocks exploit conjugate symmetry in FFT, reducing overhead by half. Multi-directional 3D scanning (slice, cross-slice, local3D) realizes Mamba’s long-range modeling within the frequency domain (Zhang et al., 5 Aug 2025).
Multi-modal Fusion: Three-branch architectures couple image reconstruction (IR) and fusion simultaneously. SFMB combines multi-scale Mamba, channel attention, and frequency enhancement. Adaptive fusion (DFMB) deploys learned spatial masks to dynamically weigh each IR stream’s contribution (Sun et al., 10 Nov 2025).
Change Detection and Motion Perception: Joint spatio-frequency fusion blocks merge log-amplitude spectrum and spatial difference channels. In Vcamba, spatial-frequency motion fusion integrates dual-domain cues via sequence concatenation/cross-merging for temporal segmentation (Wijenayake et al., 11 Aug 2025, Li et al., 31 Jul 2025).
Generative Modeling: DiMSUM uses parallel spatial/wavelet-Mamba streams, cross-attention fusion, and shared transformers. Haar-based multi-level DWT optimizes local/global frequency capture, improving order-aware generation and training convergence (Phung et al., 2024).

4. Computational Complexity and Efficiency

SFMFusion maintains linear complexity in the spatial dimension, realized by selective scan state-space blocks:

$O(Nd)$ per block for feature size $d$ and spatial size $N$
FSM: $O(HW \log HW)$ for FFT/IFFT Experimental setups validate small memory footprints:
RS super-resolution: 11.76M params, 128G FLOPs, 46MB peak GPU memory, 100ms/image (Xiao et al., 2024)
Medical SR: 0.72–0.74M params for all five modalities (Huang et al., 31 Oct 2025)

Compared to CNN and Transformer baselines (e.g., HAT-L, LBNET), SFMFusion consistently achieves a superior accuracy/efficiency trade-off (e.g., 0.11dB PSNR gain for 19% compute/28% memory (Xiao et al., 2024)).

5. Training Protocols and Loss Functions

Typical loss functions:

Super-resolution: $L_\text{pix} = \|I_\text{SR} - I_\text{HR}\|_1$ , sometimes MSE, without perceptual or adversarial terms (Xiao et al., 2024, Huang et al., 31 Oct 2025)
Segmentation/Classification: Cross-entropy for class labels, sometimes augmented with specialized metrics (SeK loss, (Wijenayake et al., 11 Aug 2025))
Fusion: Weighted sum of fusion loss and IR branch reconstruction losses, plus gradient-based edge terms (Sun et al., 10 Nov 2025)
Image Restoration: Frequency loss terms can be included (amplitude/phase spectrum matching) for further enhancement (Zhen et al., 2024)

Optimizers: Adam/AdamW (typical $\beta_1=$ 0.9, $\beta_2=$ 0.999); learning rates $1\text{e-4}$ to $3\text{e-4}$ , decayed over epochs or iterations. Hardware: PyTorch, single RTX3090, batches 4–8 for large images.

6. Experimental Benchmarks and Quantitative Results

Representative metrics (all verbatim from sources):

Task	Dataset	SFMFusion Metric	Baseline	Params/FLOPs/Memory
SR RSI	AID/DOTA/DIOR	PSNR 31.98dB/SSIM .83	HAT-L −0.11dB	11.76M/128G/46MB
Med SR	US/OCT/CT/MRI/ES	PSNR 38.13–20.98	LBNET, LGSR	0.72–0.74M
3D Seg	BraTS2023	WT Dice 94.69/HD 3.41	SegMamba	—
Change Detec	SECOND/Landsat	OA 88.62/96.25 %	—	—
Motion CamO	MoCA-MASK/CAD2016	mIoU .369/.509	SLT-Net/EMIP	MACs 10.88G
Multi-Mod Fus	MSRS IVIF	MI 3.0, VIF .87	—	0.7M/439G
Gen Model	CelebA-256	FID 4.62, recall .52	DiT/LFM	460M

Ablation studies support the necessity of frequency streams, learnable scaling/gating, and hybrid blocks. Removing frequency branches or adaptive fusion uniformly degrades performance. For instance, semantic change detection shows a –1.26% $F_\text{scd}$ and –1.47% SeK drop without FFT features (Wijenayake et al., 11 Aug 2025).

7. Significance, Domain Transfer, and Open Directions

SFMFusion establishes a paradigm for integrating spectral feature augmentation into linearly-scalable state-space frameworks. The combination of frequency selection (via FFT, DCT, or wavelet), spatial modeling (selective scan), and adaptive gating/fusion allows for efficient, interpretable mixing of global and local cues.

Notable advantages:

Retains global context for large-scale images without quadratic cost.
Explicitly restores high-frequency content, improving fine detail recovery.
Adaptive fusion/gating mechanisms reconcile conflicting information sources.

Current research extends SFMFusion patterns to generative diffusion models with parallel wavelet-based and spatial scans, cross-attention fusion, and global transformers (Phung et al., 2024). Further possible directions include generalized SSM backbones, advanced spectral bases (beyond Haar/DCT), and dynamic fusion schemes for multi-modal or multi-scale tasks.

SFMFusion now underpins state-of-the-art methods across remote sensing, medical imaging, multi-modal fusion, motion perception, and image synthesis, setting new Pareto-optimal points in accuracy vs. efficiency. Its modular design facilitates adaptation wherever spatial-frequency complementarity and linear computation are essential.