Spatial-Frequency Enhanced Mamba Fusion
- The paper introduces an innovative fusion strategy that combines state-space Mamba blocks with spatial-frequency augmentation to overcome CNN and Transformer limitations.
- This methodology employs a Vision State-Space Module and Frequency Selection Module to achieve linear-complexity global modeling and enhanced high-frequency detail retention.
- Empirical results across remote sensing, medical imaging, and generative modeling demonstrate superior efficiency and performance, validated by improved PSNR/SSIM metrics and reduced computation.
Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) refers to a class of architectures that combine state-space modeling (Mamba blocks) with explicit spatial-frequency feature augmentation. These architectures address the limitations of traditional CNNs (limited receptive field) and Transformers (quadratic complexity) by providing linear-complexity global modeling and targeted frequency enhancement. SFMFusion has been instantiated for super-resolution, segmentation, multi-modal fusion, motion perception, change detection, and generative modeling across remote sensing, medical imaging, and general computer vision. Its core innovation is the coordinated fusion of spatial and frequency-domain representations—often via custom blocks that mine, refine, and merge spatial structures and frequency cues—within a scalable state-space framework.
1. Architectural Foundations and Key Components
All SFMFusion variants build on the selective state-space model (Mamba) as the primary global feature mixer. The canonical design is a multi-stream network, with each block fusing spatial and frequency (often Fourier) information. The typical computational graph consists of:
- Vision State-Space Module (VSSM): Implements linear-complexity global modeling, replacing self-attention with time/sequence recurrences. In 2D vision, selective scans yield global spatial dependencies efficiently.
- Frequency Selection Module (FSM): Applies FFT (or DCT/wavelet) per channel, followed by small convolutional attentional filter-masks to amplify or suppress informative frequency bands. The preferred variant uses two 1×1 convs and GELU in complex–real spectrum, outperforms ReLU gating and raw selection.
- Hybrid Gate Module (HGM): Applies pixel-wise gating; splits post-fusion features into two, applies channel- and spatial-attention, and fuses with a learned mask for adaptive local bias re-injection.
- Learnable Scaling Adaptors: Per-block scalars (α_l, β_l) that mediate the residual connection strength, addressing conflicts between spatial/frequency streams.
The typical block-level pseudocode (for remote sensing SR) is:
1 2 |
y = α_l * x_{l-1} + VSSM(LN(x_{l-1})) + FSM(LN(x_{l-1}))
x_l = β_l * y + HGM(LN(y)) + FSM(LN(y)) |
2. Mathematical Formulations and Fusion Strategies
State-Space Modeling
The core SSM (as realized in Mamba) evolves hidden states as: where are learnable and B, C may be input-dependent for "selective" adaptation. In 2D vision, selective scanning (along spatial axes or directions) yields linear-time mixing, compared to for self-attention.
Frequency Feature Mining
FSM applies channel-wise 2D FFT: Then,
Finally, inverse FFT returns to the spatial domain: Learned convs implement soft frequency masks for spectrum selection.
Hybrid Gating
Given concatenated features , split into :
- : Channel/coordinate features → Conv, DWConv, Channel-Attention
- : Linear-per-pixel mix, GELU Fusion: (mask) gates local, spatially-varying activations.
Scaling Adaptors
In FMB, every pre- and post-merge, multiply by block-specific scalar (α, β) prior to addition. Ablations show direct sum of streams without scaling impairs overall performance.
3. Domain-Specific Instantiations
SFMFusion is adapted across several domains:
- Remote Sensing Super-Resolution: Frequency-assisted blocks yield superior PSNR/SSIM at 19–28% the memory/FLOPs of Transformer baselines (Xiao et al., 8 May 2024). The model structure accommodates large RSIs (>512×512) with linear scaling.
- Medical Image SR: SFMFusion combines Gated Attention-enhanced SSMs and Pyramid Frequency Fusion. Explicit multi-scale high-frequency reinjection produces sharper anatomical structures (e.g., vessel/tissue boundaries), outperforming CNN+ViT hybrids with only 0.74M params (Huang et al., 31 Oct 2025).
- 3D Medical Segmentation: Symmetry-driven dual-branch blocks exploit conjugate symmetry in FFT, reducing overhead by half. Multi-directional 3D scanning (slice, cross-slice, local3D) realizes Mamba’s long-range modeling within the frequency domain (Zhang et al., 5 Aug 2025).
- Multi-modal Fusion: Three-branch architectures couple image reconstruction (IR) and fusion simultaneously. SFMB combines multi-scale Mamba, channel attention, and frequency enhancement. Adaptive fusion (DFMB) deploys learned spatial masks to dynamically weigh each IR stream’s contribution (Sun et al., 10 Nov 2025).
- Change Detection and Motion Perception: Joint spatio-frequency fusion blocks merge log-amplitude spectrum and spatial difference channels. In Vcamba, spatial-frequency motion fusion integrates dual-domain cues via sequence concatenation/cross-merging for temporal segmentation (Wijenayake et al., 11 Aug 2025, Li et al., 31 Jul 2025).
- Generative Modeling: DiMSUM uses parallel spatial/wavelet-Mamba streams, cross-attention fusion, and shared transformers. Haar-based multi-level DWT optimizes local/global frequency capture, improving order-aware generation and training convergence (Phung et al., 6 Nov 2024).
4. Computational Complexity and Efficiency
SFMFusion maintains linear complexity in the spatial dimension, realized by selective scan state-space blocks:
- per block for feature size and spatial size
- FSM: for FFT/IFFT Experimental setups validate small memory footprints:
- RS super-resolution: 11.76M params, 128G FLOPs, 46MB peak GPU memory, 100ms/image (Xiao et al., 8 May 2024)
- Medical SR: 0.72–0.74M params for all five modalities (Huang et al., 31 Oct 2025)
Compared to CNN and Transformer baselines (e.g., HAT-L, LBNET), SFMFusion consistently achieves a superior accuracy/efficiency trade-off (e.g., 0.11dB PSNR gain for 19% compute/28% memory (Xiao et al., 8 May 2024)).
5. Training Protocols and Loss Functions
Typical loss functions:
- Super-resolution: , sometimes MSE, without perceptual or adversarial terms (Xiao et al., 8 May 2024, Huang et al., 31 Oct 2025)
- Segmentation/Classification: Cross-entropy for class labels, sometimes augmented with specialized metrics (SeK loss, (Wijenayake et al., 11 Aug 2025))
- Fusion: Weighted sum of fusion loss and IR branch reconstruction losses, plus gradient-based edge terms (Sun et al., 10 Nov 2025)
- Image Restoration: Frequency loss terms can be included (amplitude/phase spectrum matching) for further enhancement (Zhen et al., 15 Apr 2024)
Optimizers: Adam/AdamW (typical 0.9, 0.999); learning rates to , decayed over epochs or iterations. Hardware: PyTorch, single RTX3090, batches 4–8 for large images.
6. Experimental Benchmarks and Quantitative Results
Representative metrics (all verbatim from sources):
| Task | Dataset | SFMFusion Metric | Baseline | Params/FLOPs/Memory |
|---|---|---|---|---|
| SR RSI | AID/DOTA/DIOR | PSNR 31.98dB/SSIM .83 | HAT-L −0.11dB | 11.76M/128G/46MB |
| Med SR | US/OCT/CT/MRI/ES | PSNR 38.13–20.98 | LBNET, LGSR | 0.72–0.74M |
| 3D Seg | BraTS2023 | WT Dice 94.69/HD 3.41 | SegMamba | — |
| Change Detec | SECOND/Landsat | OA 88.62/96.25 % | — | — |
| Motion CamO | MoCA-MASK/CAD2016 | mIoU .369/.509 | SLT-Net/EMIP | MACs 10.88G |
| Multi-Mod Fus | MSRS IVIF | MI 3.0, VIF .87 | — | 0.7M/439G |
| Gen Model | CelebA-256 | FID 4.62, recall .52 | DiT/LFM | 460M |
Ablation studies support the necessity of frequency streams, learnable scaling/gating, and hybrid blocks. Removing frequency branches or adaptive fusion uniformly degrades performance. For instance, semantic change detection shows a –1.26% and –1.47% SeK drop without FFT features (Wijenayake et al., 11 Aug 2025).
7. Significance, Domain Transfer, and Open Directions
SFMFusion establishes a paradigm for integrating spectral feature augmentation into linearly-scalable state-space frameworks. The combination of frequency selection (via FFT, DCT, or wavelet), spatial modeling (selective scan), and adaptive gating/fusion allows for efficient, interpretable mixing of global and local cues.
Notable advantages:
- Retains global context for large-scale images without quadratic cost.
- Explicitly restores high-frequency content, improving fine detail recovery.
- Adaptive fusion/gating mechanisms reconcile conflicting information sources.
Current research extends SFMFusion patterns to generative diffusion models with parallel wavelet-based and spatial scans, cross-attention fusion, and global transformers (Phung et al., 6 Nov 2024). Further possible directions include generalized SSM backbones, advanced spectral bases (beyond Haar/DCT), and dynamic fusion schemes for multi-modal or multi-scale tasks.
SFMFusion now underpins state-of-the-art methods across remote sensing, medical imaging, multi-modal fusion, motion perception, and image synthesis, setting new Pareto-optimal points in accuracy vs. efficiency. Its modular design facilitates adaptation wherever spatial-frequency complementarity and linear computation are essential.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free