Spatial-Frequency Enhanced Mamba Block

Updated 17 November 2025

SFMB is a modular neural network construct that fuses spatial details with frequency domain cues using state-space modeling.
It employs dual or triple branch architectures that process spatial and frequency features separately before lightweight, residual fusion.
Adaptive frequency processing and multi-directional scanning in SFMB enhance global context recovery and edge preservation in image tasks.

The Spatial-Frequency Enhanced Mamba Block (SFMB) is a modular neural network construct designed to fuse complementary spatial and frequency domain information using state-space modeling (Mamba) for image and video tasks across medical imaging, remote sensing, deraining, image fusion, super-resolution, and generation. SFMB systematically unites spatial detail preservation and global context recovery, leveraging physics-informed frequency processing and directional scanning mechanisms to surpass conventional convolutional or transformer paradigms.

1. Core Architectural Principles

The foundational SFMB architecture comprises dual or triple branches specialized for spatial and frequency encodings, fused via lightweight neural blocks. In SSFMamba (Zhang et al., 5 Aug 2025), the architecture replaces conventional transformer or convolution blocks at each encoder stage of a 3D segmentation network with SFMB, which executes:

Spatial Branch: Captures local detail and some global structure via Mamba-based state-space modeling of normalized feature tensors $F_{\text{in}}\in\mathbb{R}^{C\times H\times W\times D}$ , i.e., $F_{s} = \text{Mamba}(\text{LN}(F_{\text{in}}))$ .
Frequency Branch: Computes FFT of each channel, extracts magnitude and phase, processes magnitude with two 1×1×1 convolutions and LeakyReLU, then applies a 3D Multi-Directional Scanning Mechanism (MDSM) to unfold the tensor along axial, coronal, and sagittal directions. Each direction’s feature sequence is enhanced via Mamba and reconstructed to spatial domain with phase-reinstated IFFT.
Fusion: Outputs from both branches are summed and fused via a tiny MLP (two 1×1×1 convolutions), followed by a residual connection: $F_{\text{out}} = F_{\text{in}} + F_{\text{fus}}$ .

This dual-path design ensures preservation of fine edges through spatial detail, sharp boundaries via frequency context, and long-range semantic continuity using global order-aware state-space modeling.

2. Frequency Domain Integration and Scanning Mechanisms

SFMB leverages frequency domain operations for global modeling and edge preservation:

FFT and Decomposition: Extraction of amplitude and phase from 3D FFT coefficients exploits conjugate symmetry, allowing recovery of bidirectional features and suppression of artifacts.
Multi-Directional or Spiral Frequency Scans: In SSFMamba (Zhang et al., 5 Aug 2025), MDSM unfolds the frequency features into 1D sequences along three orthogonal axes for separate Mamba enhancement. In Vcamba (Li et al., 31 Jul 2025), AFE module utilizes spiral-radial scanning from the DC frequency center, processing low-to-high and high-to-low frequency bins distinctively to maximize semantic continuity and enhance motion cues.
Adaptive Frequency Gating: DemMamba (Xu et al., 2024) introduces a learnable compressor network in the frequency branch to attenuate moiré artifacts, applying a soft mask to FFT coefficients before reconstruction.

The scanning mechanisms in SFMB’s frequency path are critical for maintaining semantic relationships and fully exploiting the structure-rich content in frequency space, unlike naive global self-attention or convolutional filtering.

3. Cross-Domain Fusion Strategies

Fusion of spatial and frequency cues in SFMB variants is achieved by:

Elementwise Summation and Channel Mixing: Outputs from spatial and frequency branches are summed and compressed via lightweight MLPs or 1×1 convolutions, enabling efficient channel-wise interaction and residual learning.
Attention-Driven Fusion: Mamba-FCS (Wijenayake et al., 11 Aug 2025) utilizes CBAM attention modules to perform sequential channel and spatial attention, refining concatenated spatial, frequency, and explicit difference features.
Dynamic or Adaptive Gating: Several designs (MMR-Mamba (Zou et al., 2024), FMSR (Xiao et al., 2024)) employ per-channel scaling norms or bottleneck gating networks to selectively transfer or reinforce weak features between domains, suppress redundancy, and increase discriminability of informative patterns.

Quantitative ablations confirm that dropping frequency fusion or replacing with naive summations degrades performance consistently across edge, boundary, and semantic understanding metrics.

4. State-Space Modeling and Computational Foundations

SFMB relies on Mamba’s state-space framework for efficient sequence modeling:

State-Space Recurrences: At the core, Mamba applies a recurrence $h_t = \overline{A}h_{t-1} + \overline{B}x_t$ over unfolded spatial or frequency sequences.
Directional Scans: Order-sensitive scanning (e.g., sweep-4 in DiMSUM (Phung et al., 2024)) and multi-directional traversals (axial, coronal, sagittal, spiral) enable extraction of both local and global dependencies within reasonable computational envelopes.
Linear Complexity: All state-space operations are linear in sequence length ( $O(NC)$ ) compared to $O(N^2C)$ for quadratic self-attention. FFT/IFFT operations incur $O(HWD\log(HWD))$ cost, much lower in practical regimes.

The state-space modeling paradigm underpins SFMB’s scalability and applicability to large volumetric data (medical, remote sensing) and high-resolution imagery.

5. Implementation Details, Resource Profiles, and Performance Impact

Implementation of SFMB across domains shares recurring modules and parameterization.

Convolutional Kernels: Frequency branches commonly apply two stacked 1×1×1 or 3×3 convolutions post-FFT for feature compression and gating; spatial branches utilize depthwise separable convolutions for local enhancement.
Normalization and Activation: LayerNorm (per-voxel or per-channel) precedes state-space or frequency processing; LeakyReLU, GELU, and SiLU activations are recurrent for gating/nonlinearity.
Parameter Counts and FLOPs: Modern SFMB blocks (e.g., FGMamba (Huang et al., 31 Oct 2025), SFMFusion (Sun et al., 10 Nov 2025)) operate with <0.75M parameters and 15–439 GFLOPs per full network, substantially lower than transformer-based architectures.
Loss Functions: In segmentation contexts, standard per-voxel cross-entropy is applied; SR/deraining may use hybrid spatial + amplitude/phase losses (e.g., FreqMamba (Zhen et al., 2024)).
Performance Metrics: SFMB variants yield consistent parameter-efficient improvements in Dice score/HD95 (segmentation), F_scd/mIoU/SeK (change detection), PSNR/SSIM (SR/deraining), and FID/Convergence (generation).

Ablation studies across works demonstrate that frequency path, adaptive fusion, and multi-directional scanning individually and synergistically contribute to measurable performance gains.

6. Application-Specific Modifications and Extensions

SFMB’s core logic adapts flexibly to diverse tasks:

3D Medical Segmentation (SSFMamba (Zhang et al., 5 Aug 2025)): Full multi-directional frequency scanning, symmetry-aware fusion, and per-voxel classification in BraTS datasets.
Video Demoireing (DemMamba (Xu et al., 2024)): Interleaving SFMB with temporal Mamba blocks and adaptive frequency filtering for artifact suppression.
Super-Resolution (FMSR (Xiao et al., 2024), FGMamba (Huang et al., 31 Oct 2025)): Hybrid fusion of Vision State-Space modeling, frequency selection, and local gating; multi-level fusion strategies.
Multi-Modal Imaging (SFMFusion (Sun et al., 10 Nov 2025), MMR-Mamba (Zou et al., 2024)): Specialized channel-enhanced, phase/amplitude fusion, and dynamic multi-modal gating to integrate structural and textural cues.
Image Generation (DiMSUM (Phung et al., 2024)): Wavelet-Mamba scan and cross-attention fusion with globally shared transformers for low FID and order-aware context mixing.
Video Camouflage Detection (Vcamba (Li et al., 31 Jul 2025)): Sequential motion enhancement, dual-domain feature fusion for optimal motion perception in highly ambiguous scenes.

Application-specific modules (e.g., SFMF, SFF, ASFF, FSM, HGM) reinforce SFMB’s adaptability and domain relevance.

7. Quantitative Benchmarking and Empirical Insights

Empirical evaluations across SFMB-enabled models demonstrate consistent improvements:

Paper	Domain	Key Metric / Gain
SSFMamba (Zhang et al., 5 Aug 2025)	3D Med. Segmentation	↑HD95, ↑Dice (BraTS2020/2023)
DemMamba (Xu et al., 2024)	Video Demoireing	+1.3 dB PSNR vs. SOTAs
DiMSUM (Phung et al., 2024)	Image Generation	FID: 4.65 (vs 6.19 for baseline)
FreqMamba (Zhen et al., 2024)	Image Deraining	+0.11–0.33 dB PSNR over Restormer
Mamba-FCS (Wijenayake et al., 11 Aug 2025)	Change Detection	+1.26% F_scd; ↑Edge/Texture recovery
SFMFusion (Sun et al., 10 Nov 2025)	MM Image Fusion	Top AvgRank (MSRS), ↓FLOPs vs ViT
FGMamba (Huang et al., 31 Oct 2025)	Med. SR	+0.0031 dB PSNR, <0.75M params

These results corroborate SFMB’s superior spatial–frequency fusion capabilities, computational efficiency, and generalization across modalities and problem domains.

SFMB has established itself as an effective block-level paradigm for dual-domain feature extraction and fusion, leveraging advanced state-space modeling, frequency-physics principles, and adaptive attention. Its architectural modularity enables flexible instantiation and scalability, supporting strong empirical results with efficient resource consumption in computational vision systems.