Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 133 tok/s
Gemini 3.0 Pro 55 tok/s Pro
Gemini 2.5 Flash 164 tok/s Pro
Kimi K2 202 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Spatial-Frequency Enhanced Mamba Fusion

Updated 17 November 2025
  • The paper introduces an innovative fusion strategy that combines state-space Mamba blocks with spatial-frequency augmentation to overcome CNN and Transformer limitations.
  • This methodology employs a Vision State-Space Module and Frequency Selection Module to achieve linear-complexity global modeling and enhanced high-frequency detail retention.
  • Empirical results across remote sensing, medical imaging, and generative modeling demonstrate superior efficiency and performance, validated by improved PSNR/SSIM metrics and reduced computation.

Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) refers to a class of architectures that combine state-space modeling (Mamba blocks) with explicit spatial-frequency feature augmentation. These architectures address the limitations of traditional CNNs (limited receptive field) and Transformers (quadratic complexity) by providing linear-complexity global modeling and targeted frequency enhancement. SFMFusion has been instantiated for super-resolution, segmentation, multi-modal fusion, motion perception, change detection, and generative modeling across remote sensing, medical imaging, and general computer vision. Its core innovation is the coordinated fusion of spatial and frequency-domain representations—often via custom blocks that mine, refine, and merge spatial structures and frequency cues—within a scalable state-space framework.

1. Architectural Foundations and Key Components

All SFMFusion variants build on the selective state-space model (Mamba) as the primary global feature mixer. The canonical design is a multi-stream network, with each block fusing spatial and frequency (often Fourier) information. The typical computational graph consists of:

  • Vision State-Space Module (VSSM): Implements linear-complexity global modeling, replacing self-attention with time/sequence recurrences. In 2D vision, selective scans yield global spatial dependencies efficiently.
  • Frequency Selection Module (FSM): Applies FFT (or DCT/wavelet) per channel, followed by small convolutional attentional filter-masks to amplify or suppress informative frequency bands. The preferred variant uses two 1×1 convs and GELU in complex–real spectrum, outperforms ReLU gating and raw selection.
  • Hybrid Gate Module (HGM): Applies pixel-wise gating; splits post-fusion features into two, applies channel- and spatial-attention, and fuses with a learned mask for adaptive local bias re-injection.
  • Learnable Scaling Adaptors: Per-block scalars (α_l, β_l) that mediate the residual connection strength, addressing conflicts between spatial/frequency streams.

The typical block-level pseudocode (for remote sensing SR) is:

1
2
y = α_l * x_{l-1} + VSSM(LN(x_{l-1})) + FSM(LN(x_{l-1}))
x_l = β_l * y + HGM(LN(y)) + FSM(LN(y))
This establishes parallel global/local “streams” with distinct frequency and spatial augmentations, subsequently merged through adaptive gating.

2. Mathematical Formulations and Fusion Strategies

State-Space Modeling

The core SSM (as realized in Mamba) evolves hidden states as: xt+1=Axt+But,yt=Cxt+Dut,x_{t+1} = A x_t + B u_t,\qquad y_t = C x_t + D u_t, where A,B,C,DA,B,C,D are learnable and B, C may be input-dependent for "selective" adaptation. In 2D vision, selective scanning (along spatial axes or directions) yields linear-time mixing, compared to O(N2)O(N^2) for self-attention.

Frequency Feature Mining

FSM applies channel-wise 2D FFT: F=FFT2(X)F = \text{FFT2}(X) Then,

A=Conv1×1(ReGELU(Conv1×1(ReF)))A = \text{Conv}_{1\times1}( \text{ReGELU}( \text{Conv}_{1\times1}( \text{Re} F ) ) )

Finally, inverse FFT returns to the spatial domain: Z=IFFT2(A)Z = \text{IFFT2}(A) Learned convs implement soft frequency masks for spectrum selection.

Hybrid Gating

Given concatenated features YRH×W×2CY \in \mathbb{R}^{H\times W\times 2C}, split into Y1,Y2Y_1, Y_2:

  • Y1Y_1: Channel/coordinate features → Conv, DWConv, Channel-Attention
  • Y2Y_2: Linear-per-pixel mix, GELU Fusion: Yout=Conv1×1(MXcoord)Y_{out} = \text{Conv}_{1\times1}( M \odot X_{coord} ) MM (mask) gates local, spatially-varying activations.

Scaling Adaptors

In FMB, every pre- and post-merge, multiply by block-specific scalar (α, β) prior to addition. Ablations show direct sum of streams without scaling impairs overall performance.

3. Domain-Specific Instantiations

SFMFusion is adapted across several domains:

  • Remote Sensing Super-Resolution: Frequency-assisted blocks yield superior PSNR/SSIM at 19–28% the memory/FLOPs of Transformer baselines (Xiao et al., 8 May 2024). The model structure accommodates large RSIs (>512×512) with linear scaling.
  • Medical Image SR: SFMFusion combines Gated Attention-enhanced SSMs and Pyramid Frequency Fusion. Explicit multi-scale high-frequency reinjection produces sharper anatomical structures (e.g., vessel/tissue boundaries), outperforming CNN+ViT hybrids with only 0.74M params (Huang et al., 31 Oct 2025).
  • 3D Medical Segmentation: Symmetry-driven dual-branch blocks exploit conjugate symmetry in FFT, reducing overhead by half. Multi-directional 3D scanning (slice, cross-slice, local3D) realizes Mamba’s long-range modeling within the frequency domain (Zhang et al., 5 Aug 2025).
  • Multi-modal Fusion: Three-branch architectures couple image reconstruction (IR) and fusion simultaneously. SFMB combines multi-scale Mamba, channel attention, and frequency enhancement. Adaptive fusion (DFMB) deploys learned spatial masks to dynamically weigh each IR stream’s contribution (Sun et al., 10 Nov 2025).
  • Change Detection and Motion Perception: Joint spatio-frequency fusion blocks merge log-amplitude spectrum and spatial difference channels. In Vcamba, spatial-frequency motion fusion integrates dual-domain cues via sequence concatenation/cross-merging for temporal segmentation (Wijenayake et al., 11 Aug 2025, Li et al., 31 Jul 2025).
  • Generative Modeling: DiMSUM uses parallel spatial/wavelet-Mamba streams, cross-attention fusion, and shared transformers. Haar-based multi-level DWT optimizes local/global frequency capture, improving order-aware generation and training convergence (Phung et al., 6 Nov 2024).

4. Computational Complexity and Efficiency

SFMFusion maintains linear complexity in the spatial dimension, realized by selective scan state-space blocks:

  • O(Nd)O(Nd) per block for feature size dd and spatial size NN
  • FSM: O(HWlogHW)O(HW \log HW) for FFT/IFFT Experimental setups validate small memory footprints:
  • RS super-resolution: 11.76M params, 128G FLOPs, 46MB peak GPU memory, 100ms/image (Xiao et al., 8 May 2024)
  • Medical SR: 0.72–0.74M params for all five modalities (Huang et al., 31 Oct 2025)

Compared to CNN and Transformer baselines (e.g., HAT-L, LBNET), SFMFusion consistently achieves a superior accuracy/efficiency trade-off (e.g., 0.11dB PSNR gain for 19% compute/28% memory (Xiao et al., 8 May 2024)).

5. Training Protocols and Loss Functions

Typical loss functions:

  • Super-resolution: Lpix=ISRIHR1L_\text{pix} = \|I_\text{SR} - I_\text{HR}\|_1, sometimes MSE, without perceptual or adversarial terms (Xiao et al., 8 May 2024, Huang et al., 31 Oct 2025)
  • Segmentation/Classification: Cross-entropy for class labels, sometimes augmented with specialized metrics (SeK loss, (Wijenayake et al., 11 Aug 2025))
  • Fusion: Weighted sum of fusion loss and IR branch reconstruction losses, plus gradient-based edge terms (Sun et al., 10 Nov 2025)
  • Image Restoration: Frequency loss terms can be included (amplitude/phase spectrum matching) for further enhancement (Zhen et al., 15 Apr 2024)

Optimizers: Adam/AdamW (typical β1=\beta_1=0.9, β2=\beta_2=0.999); learning rates 1e-41\text{e-4} to 3e-43\text{e-4}, decayed over epochs or iterations. Hardware: PyTorch, single RTX3090, batches 4–8 for large images.

6. Experimental Benchmarks and Quantitative Results

Representative metrics (all verbatim from sources):

Task Dataset SFMFusion Metric Baseline Params/FLOPs/Memory
SR RSI AID/DOTA/DIOR PSNR 31.98dB/SSIM .83 HAT-L −0.11dB 11.76M/128G/46MB
Med SR US/OCT/CT/MRI/ES PSNR 38.13–20.98 LBNET, LGSR 0.72–0.74M
3D Seg BraTS2023 WT Dice 94.69/HD 3.41 SegMamba
Change Detec SECOND/Landsat OA 88.62/96.25 %
Motion CamO MoCA-MASK/CAD2016 mIoU .369/.509 SLT-Net/EMIP MACs 10.88G
Multi-Mod Fus MSRS IVIF MI 3.0, VIF .87 0.7M/439G
Gen Model CelebA-256 FID 4.62, recall .52 DiT/LFM 460M

Ablation studies support the necessity of frequency streams, learnable scaling/gating, and hybrid blocks. Removing frequency branches or adaptive fusion uniformly degrades performance. For instance, semantic change detection shows a –1.26% FscdF_\text{scd} and –1.47% SeK drop without FFT features (Wijenayake et al., 11 Aug 2025).

7. Significance, Domain Transfer, and Open Directions

SFMFusion establishes a paradigm for integrating spectral feature augmentation into linearly-scalable state-space frameworks. The combination of frequency selection (via FFT, DCT, or wavelet), spatial modeling (selective scan), and adaptive gating/fusion allows for efficient, interpretable mixing of global and local cues.

Notable advantages:

  • Retains global context for large-scale images without quadratic cost.
  • Explicitly restores high-frequency content, improving fine detail recovery.
  • Adaptive fusion/gating mechanisms reconcile conflicting information sources.

Current research extends SFMFusion patterns to generative diffusion models with parallel wavelet-based and spatial scans, cross-attention fusion, and global transformers (Phung et al., 6 Nov 2024). Further possible directions include generalized SSM backbones, advanced spectral bases (beyond Haar/DCT), and dynamic fusion schemes for multi-modal or multi-scale tasks.

SFMFusion now underpins state-of-the-art methods across remote sensing, medical imaging, multi-modal fusion, motion perception, and image synthesis, setting new Pareto-optimal points in accuracy vs. efficiency. Its modular design facilitates adaptation wherever spatial-frequency complementarity and linear computation are essential.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Spatial-Frequency Enhanced Mamba Fusion (SFMFusion).