Frequency-Aware Fusion Methods

Updated 3 November 2025

Frequency-aware fusion is a paradigm that combines frequency-domain analysis with spatial feature processing to retain global structures and fine-grained details.
It employs mathematical transforms such as FFT, DWT, and DCT to decompose data into frequency bands, enabling selective fusion of low- and high-frequency components.
This method has proven effective in applications like infrared-visible image fusion, biomedical imaging, and deepfake detection by enhancing both objective metrics and visual fidelity.

Frequency-aware fusion refers to methodologies that explicitly extract, manipulate, and integrate frequency-domain information—often in conjunction with spatial-domain features—within multimodal or multi-representational pipelines. This paradigm leverages the mathematical properties of frequency analysis (e.g., via Fourier or wavelet transforms) to enhance information preservation, improve discriminability, and mitigate domain-specific degradations or artifacts, particularly in tasks where spatial-only fusion is insufficient. Frequency-aware fusion has demonstrated superior empirical effectiveness across diverse domains: infrared-visible image fusion, multi-modal biomedical imaging, pansharpening, detection, segmentation, and multimodal representation learning.

1. Conceptual Foundations and Core Principles

Frequency-aware fusion is based on the idea that different information modalities, or even feature hierarchies within a modality, can be decomposed into distinct frequency bands. Each band offers unique task-relevant cues:

Low-frequency components: Carry global structural information (e.g., luminance, object layouts, scene context).
High-frequency components: Encode fine-grained details (e.g., edges, textures, artifacts), often critical for detail preservation and discriminability—especially where local variations or manipulations occur.

Most classical fusion pipelines (e.g., for image, sensor, or representation fusion) used spatial-domain feature concatenation, addition, or attention mechanisms, potentially discarding or diluting informative frequency cues. Frequency-aware approaches explicitly transform data or feature maps into the frequency domain (using FFT, DWT, or DCT), perform frequency-sensitive operations (e.g., cross-attention, masking, learnable filtering, or tailored loss constraints), and then recombine the processed frequency and spatial features to yield a strongly complementary fused output. This strategy governs the new state-of-the-art in domains such as IVIF, pansharpening, BCI frequency decoding, cross-modal representation learning, and deepfake detection.

2. Mathematical Tools and Transformations

Frequency-aware fusion frameworks exploit a variety of mathematical transforms and decomposition methods that project spatial or sequential data into the frequency domain:

Fast Fourier Transform (FFT) decomposes signals into sine and cosine basis functions to extract amplitude and phase spectra, supporting both global and local analysis (Hu et al., 30 Oct 2024, Zheng et al., 9 Jul 2025).
Discrete Wavelet Transform (DWT) and sub-band decomposition enable multi-scale representation, splitting data into low- and multiple high-frequency bands, often aligning with semantic information (e.g., structure vs. detail in images, signal harmonics in BCI) (Zhang et al., 4 Jun 2025, Xing et al., 2022, Zhang et al., 5 Sep 2025).
Discrete Cosine Transform (DCT), often used in vision transformers, captures spectral coefficients for spatial feature maps, improving high-frequency detail awareness (Zhang et al., 12 Jun 2025).
Gaussian/Laplacian Pyramids separate features at distinct frequency bands for multi-resolution processing (Sun et al., 25 Mar 2025).

Transform-based decompositions facilitate selective or learnable manipulation—such as masking, modulation, sub-band attention, or cross-attention between frequency-specific representations—before reintegration with spatial features, typically via inverse transform (IFFT, IDWT, etc.).

3. Architectural Patterns and Exemplary Modules

Recent models operationalize frequency-aware fusion using diverse architectural blocks, including:

Module Name/Abbreviation	Core Operation	Domain(s)
DMRM (Dual-Modality Refinement Module) (Hu et al., 30 Oct 2024)	Gradient, attention, complementary spatial-paths	Spatial
FDFM (Frequency Domain Fusion Module) (Hu et al., 30 Oct 2024, Zheng et al., 9 Jul 2025)	FFT→Conv→Concat→IFFT fusion	Frequency
IFSA (Intra-Frequency Self-Attention) (Zhang et al., 4 Jun 2025)	Cross-attention per sub-band post-wavelet	Frequency
IFI (Inter-Frequency Interaction) (Zhang et al., 4 Jun 2025)	Channel & spatial attention across bands	Frequency
FSAM (Frequency-Spatial Attention Mechanism) (Zhang et al., 12 Jun 2025)	2D-DCT-based frequency + spatial weighting	Both
AFDP (Adaptive Frequency Domain Perceptron) (Liu et al., 30 Jul 2025)	Directional FFTs, band-masking, channel weighting	Frequency
FCB (Frequency Compression Block) (Li et al., 1 Apr 2025)	KNN compression/scoring of multi-layer FFT tokens	Frequency
Offset/ALPF/AHPF (FreqFusion) (Chen et al., 23 Aug 2024)	Learnable low/high-pass filters, adaptive offsets	Frequency

These modules typically operate within dual-branch or multi-branch architectures—one branch processes spatial/pixel features, another processes frequency features, and a third may handle cross-modal/contextual fusion.

Parallel branch fusion: e.g., SFDFusion concatenates spatial (DMRM) and frequency (FDFM) outputs (Hu et al., 30 Oct 2024), while SFAE conducts cross-attention between spatial and frequency branch encodings (Ye et al., 2 Aug 2025).
Attention and gating: FSAM applies DCT across channel subsets, then multiplies attention weights for frequency and spatial saliency (Zhang et al., 12 Jun 2025); SFMFNet uses Haar wavelet outputs and spatial branch attention fused via dynamic gating (Lv et al., 28 Aug 2025).
Self-/cross-attention in frequency: WIFE-Fusion leverages IFSA and IFI to align and enhance intra-/inter-band frequency cues (Zhang et al., 4 Jun 2025); AdaFuse's CAF block executes cross-modality attention in both spatial and frequency branches (Gu et al., 2023).
Explicit filtering and masking: FreqFusion employs ALPF and AHPF modules to adaptively smooth or restore features based on learned frequency-aware filters (Chen et al., 23 Aug 2024); LD3CF/AFDP in LIDAR applies learnable high/low-frequency masks for crack segmentation (Liu et al., 30 Jul 2025).

The specific mathematical choices (e.g., choice of decomposition, sub-band grouping, pooling strategies, and learnable transformation or fusion rules) are tightly aligned with the physical/statistical properties of the signals or expected distortions.

4. Loss Functions and Supervision Strategies

Frequency-aware fusion frameworks often use specialized loss functions targeting frequency-domain and spatial-domain consistency to regularize learning and maximize task relevance:

Frequency-domain constraints:
- SFDFusion introduces a frequency fusion loss $\mathcal{L}_{fre}$ that maximizes correlation between IFFT-reconstructed output and original sources in frequency-salient/spatially-masked regions (Hu et al., 30 Oct 2024).
- FAFNet's HFS loss aligns high-frequency detail injection from PAN to MS bands using cross-correlation, avoiding spectral distortion (Xing et al., 2022).
- RPFNet applies a frequency contrastive loss to force fused images to match IR frequencies in salient regions and VIS frequencies in background, using adaptive spatial masks (Zheng et al., 9 Jul 2025).
Structure-preserving (high-frequency) losses:
- AdaFuse employs a hybrid content–structure loss, including gradient-based structure loss (based on structural tensors) and SSIM loss (Gu et al., 2023).
- FreqFusion explicitly improves boundary detail through feature similarity and margin metrics, reflecting its high/low-pass architecture (Chen et al., 23 Aug 2024).
Reconstruction and similarity metrics:
- L1/L2, MAE, mutual information, PSNR, SSIM, and energy-based (EN, VIFF, etc.), as suited to the fusion application at hand.

Losses are adjusted or weighted adaptively, including spatial/frequency attention masks or region specificity, as in RPFNet and SFDFusion, to compensate for modality-specific content or distinctive semantic zones.

5. Empirical Impact and Benchmark Results

Frequency-aware fusion architectures consistently improve both objective metrics and qualitative outcomes across domains:

Image fusion (IVIF, pan-sharpening, medical, underwater): SFDFusion, WIFE-Fusion, RPFNet, FAFNet, AdaFuse, and FUSION report significant advantages in entropy, mutual information, structural fidelity, and salient object/texture retention over spatial-only or naive fusion baselines (Hu et al., 30 Oct 2024, Zhang et al., 4 Jun 2025, Zheng et al., 9 Jul 2025, Xing et al., 2022, Gu et al., 2023, Walia et al., 1 Apr 2025).
Dense prediction and segmentation: Frequency-aware fusion modules underpin improvements in intra-class similarity, boundary localization accuracy, and overall mIoU/AP scores on semantic/instance/panoptic segmentation and segmentation of fine structured cracks (Chen et al., 23 Aug 2024, Liu et al., 30 Jul 2025).
Detection and recognition: Enhanced fused representations directly yield higher mean average precision (mAP) in detection tasks and improve BCI frequency recognition accuracy and information transfer rates (Hu et al., 30 Oct 2024, Zheng et al., 9 Jul 2025, Zhang et al., 2018, Ye et al., 2 Aug 2025), as well as outperform attention-rich spatial-only counterparts at lower complexity (Lv et al., 28 Aug 2025).
Multimodal representation and classification: Frequency domain transforms (FFT/DFT) and spectrum compression enable highly discriminative, globally-consistent fusion for rumor detection and cross-modal classification; see FSRU's O(N log N) spectral fusion (Lao et al., 2023), FDCT's token alignment and robustness (Sami et al., 12 Mar 2025), and FA³-CLIP's unified digital/physical attack detection via multi-layer frequency compression (Li et al., 1 Apr 2025).

Ablation studies in multiple works uniformly demonstrate that omitting frequency-aware modules or loss terms results in notable degradation of both objective and subjective metrics—highlighting the necessity of frequency-domain modeling and its synergy with spatial fusion.

6. Representative Case Studies

SFDFusion implements parallel branches: DMRM for spatial domain refinement (edge/gradient extraction + attention), and FDFM for frequency-based fusion (FFT → amplitude/phase integration → IFFT). The frequency-related loss targets region-specific reconstruction, and empirical results on MSRS/M3FD show best-in-class entropy (EN: 6.670), spatial frequency (SF: 11.070), mutual information (MI: 3.914), and visual fidelity. Fusion ablation demonstrates dual-domain necessity.

WIFE-Fusion uses DWT/IDWT to decompose features into canonical subbands. Intra-frequency self-attention (IFSA) aligns cross-modal bands at the same frequency, while inter-frequency interaction (IFI) permits cross-band, cross-modal exchange. Fusion in the frequency domain, followed by spatial domain reconstruction, achieves top scores in information preservation and detection performance across five fusion benchmarks.

The spatial-frequency hybrid aware (SFHA) module combines wavelet (frequency) and spatial attention signals via a learnable gate, directly targeting artifact detection. Token-selective cross-attention integrates features across levels, enabling lightweight, real-time detection while robustly handling subtle manipulations.

7. Broader Implications and Directions

Frequency-aware fusion architectures have demonstrated fundamental advantages for multimodal tasks involving heterogeneous information, detail preservation, or cross-domain robustness. Central themes include:

Synergy of local (spatial) and global (frequency) information: Essential in domains with spatially distributed artifacts, global degradations, or complementary cues.
Physics-informed design: Sensor characteristics and task phenomenology frequently underpin decomposition and fusion heuristics (e.g., event camera fusion (Sun et al., 25 Mar 2025), underwater restoration (Walia et al., 1 Apr 2025), SSVEP BCI decoding (Zhang et al., 2018)).
Efficiency: Spectral embeddings and frequency-domain attention often deliver O(N log N) complexity, superior interpretability, and lower memory/computation cost compared to spatial self-attention.

A plausible implication is that as multi-modal and information-rich pipelines become more common, frequency-aware fusion methods will remain pivotal, especially in domains where spatial-only representation is insufficient for robust, contextual, or degradation-robust reasoning.

References:

(Hu et al., 30 Oct 2024) SFDFusion: An Efficient Spatial-Frequency Domain Fusion Network for Infrared and Visible Image Fusion
(Zhang et al., 4 Jun 2025) WIFE-Fusion: Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion
(Zheng et al., 9 Jul 2025) Residual Prior-driven Frequency-aware Network for Image Fusion
(Zhang et al., 12 Jun 2025) FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
(Xing et al., 2022) Pansharpening via Frequency-Aware Fusion Network with Explicit Similarity Constraints
(Walia et al., 1 Apr 2025) FUSION: Frequency-guided Underwater Spatial Image recOnstructioN
(Liu et al., 16 Sep 2025) MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization
(Chen et al., 23 Aug 2024) Frequency-aware Feature Fusion for Dense Image Prediction
(Lao et al., 2023) Frequency Spectrum is More Effective for Multimodal Representation and Fusion
(Li et al., 1 Apr 2025) FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection