Cross-Modal Frequency Fusion Module

Updated 12 December 2025

Cross-Modal Frequency Fusion modules are architectural mechanisms that decompose sensor inputs into frequency bands (via Fourier or wavelet transforms) to extract and merge complementary features.
They use operations like cross-attention, pointwise convolutions, and adaptive filtering to align multi-source spectral components, ensuring effective integration despite spatial or semantic misalignment.
These modules enhance applications such as ATR, MRI reconstruction, and image fusion by providing measurable performance gains and robustness against noisy or partial sensor data.

A Cross-Modal Frequency Fusion Module refers to a family of architectural mechanisms that explicitly integrate frequency-domain feature representations across multiple sensor modalities. These modules exploit spectral decompositions—primarily through the Fourier or wavelet transforms—to align, merge, and enhance complementary frequency-specific information from heterogeneous inputs. Modern instantiations are prevalent in multi-sensor ATR, multimodal MRI, visible-infrared object detection, multi-modal image fusion, and cross-modal rumor or forgery detection. The key rationale is that many forms of domain invariance and discriminative feature synergy are more efficiently modeled in the frequency domain, enabling robust fusion even under semantic, spatial, or granularity misalignment.

1. Core Principles and Architectural Motifs

Cross-Modal Frequency Fusion modules typically operate by decomposing each modality’s signal or feature tensor into separate frequency bands, fusing these representations across modalities via domain-adapted mechanisms, and reassembling the fused representations for downstream tasks.

Central design tenets include:

Frequency-domain decomposition via DFT, DWT, or learned/AdaWAT kernels, yielding explicit band-wise (e.g., low/high) components per modality (Sami et al., 12 Mar 2025, Zheng et al., 9 Jul 2025, Zhang et al., 4 Jun 2025, Wang et al., 21 Aug 2025, Wu et al., 13 Nov 2025).
Frequency domain fusion operators operating via learned pointwise convolutions, cross-attention, sparsity constraints, or gating, enabling explicit cross-modal alignment and complementary feature aggregation in the spectral domain.
Reintegration into the spatial domain through inverse transforms, yielding fused features for classification, detection, or reconstruction (Li et al., 5 Dec 2025, Zou et al., 27 Jun 2024).

Distinct instantiations leverage uniquely adapted mechanisms, such as:

Learnable frequency filters or selection masks (AdaWAT, FMCAF, FSRU) (Wang et al., 21 Aug 2025, Berjawi et al., 20 Oct 2025, Lao et al., 2023).
Band-specific cross-attention (IFSA in WIFE-Fusion, MFDA/FDFFL in FreDFT) (Zhang et al., 4 Jun 2025, Wu et al., 13 Nov 2025).
Alignment in discrete token or embedding spaces post frequency-fusion (FDCT) (Sami et al., 12 Mar 2025).
Frequency-aware contrastive, structural similarity, or hybrid spatial–frequency losses to enforce cross-modal consistency (Zheng et al., 9 Jul 2025, Li et al., 5 Dec 2025, Zou et al., 27 Jun 2024).

2. Frequency Decomposition Techniques

Most Cross-Modal Frequency Fusion modules begin with a formal spectral decomposition. Common techniques include:

2D/1D Discrete Fourier Transform (DFT/FFT): Applied channel-wise to image, patch, token, or feature tensors, decomposing the signal into amplitude/phase components for further manipulation (Li et al., 5 Dec 2025, Lao et al., 2023, Wu et al., 13 Nov 2025, Zheng et al., 9 Jul 2025).
Wavelet Transforms: Typically the Haar or learnable wavelet filters; separates signals into multi-scale low- and high-frequency bands, allowing localized frequency modeling (Zhang et al., 4 Jun 2025, Wang et al., 21 Aug 2025).
Learnable/Flexible Frequency Analysis: Adaptive variants such as AdaWAT in AdaSFFuse, using learnable convolution kernels in place of fixed wavelet filters, enabling data-driven frequency decoupling (Wang et al., 21 Aug 2025).

These decompositions yield multiple band-specific representations per modality, e.g.,

$f_L$ , $f_H$ for low- and high-frequency (FDCT, RPFNet),
(LL, LH, HL, HH) subbands (WIFE-Fusion, AdaSFFuse),
amplitude and phase spectra (UniFS, MMR-Mamba) suitable for cross-modal fusion.

Fusion in the frequency domain employs a range of operators, with key strategies including:

Frequency-domain convolutions: 1×1 (channel-mixing) convolutions applied at each frequency coordinate, allowing global feature integration with linear parameter complexity (Zheng et al., 9 Jul 2025). This design supports spatially unconstrained mixing at low computational cost.
Cross-attention in frequency space:
- MFDA in FreDFT implements frequency-domain cross-attention by computing FFT-based query-key correlations, replacing spatial-domain softmax with element-wise frequency multiplication (Wu et al., 13 Nov 2025).
- IFSA in WIFE-Fusion applies band-specific cross-modal self-attention to enable each frequency band in one modality to attend to the homologous band in the partner (Zhang et al., 4 Jun 2025).
Prompt-guided and sparsity-aware fusion:
- UniFS integrates adaptive prompts encoding k-space sampling patterns into the frequency fusion operator, allowing dynamic adaptation to varying acquisition regimes (Li et al., 5 Dec 2025).
- FDCT employs prototype, sparse cross-modal, and instance-wise alignment constraints in a learned token space following frequency-aware decomposition (Sami et al., 12 Mar 2025).
Amplitude–phase and inter-band disentangling:
- UniFS and MMR-Mamba decouple amplitude (contrast) and phase (structure), fusing each component separately using domain-aware learned operators (Li et al., 5 Dec 2025, Zou et al., 27 Jun 2024).

4. End-to-End Training and Objectives

End-to-end optimization typically supervises both spatial-domain and frequency-domain outputs. Typical composite losses include:

Joint spatial–frequency loss: Penalizes per-pixel errors in the spatial domain and spectral discrepancies in reconstructed amplitude/phase (Li et al., 5 Dec 2025, Zou et al., 27 Jun 2024).
Frequency-contrastive and structural similarity losses: Enforce frequency-domain similarity between fused and unimodal (e.g., IR, VIS) outputs in frequency-specific regions, combined with SSIM for local detail preservation (Zheng et al., 9 Jul 2025, Wang et al., 21 Aug 2025).
Token- or prototype-alignment objectives: InfoNCE, sparsemax, and cross-entropy losses enforce alignment between frequency-aware fused representation and modality-specific tokens or prototypes (FDCT) (Sami et al., 12 Mar 2025).
Auxiliary objectives: Jensen–Shannon divergence, global attention gating, or saliency structure losses, providing further incentives for frequency-aware harmonization (Lao et al., 2023, Berjawi et al., 20 Oct 2025, Zheng et al., 9 Jul 2025).

All module parameters, including frequency filters, gating masks, attention weights, and downstream prediction heads, are trained jointly in a differentiable pipeline.

5. Empirical Performance and Impact

Cross-Modal Frequency Fusion modules deliver performance improvements across diverse domains:

ATR and multimodal detection/recognition: FDCT exceeds state-of-the-art single- and multi-modal baselines on four ATR datasets, with ablation gains of up to +1.86 pp for individual alignment losses (Sami et al., 12 Mar 2025). FMCAF achieves +13.9% mAP@50 over concatenation baselines on VEDAI vehicle detection (Berjawi et al., 20 Oct 2025).
Medical imaging (MRI reconstruction): UniFS’s explicit frequency-guided fusion yields +1.8 dB PSNR improvement on BraTS with strong out-of-domain robustness under unseen k-space patterns (Li et al., 5 Dec 2025). MMR-Mamba’s SFF/ASFF modules recover high-frequency structure and reinforce channel-wise domain complementarity for MR reconstruction (Zou et al., 27 Jun 2024).
Image fusion and synthesis: RPFNet’s frequency-domain fusion module efficiently models global dependencies and enforces complementary feature integration via frequency contrastive and SSIM losses (Zheng et al., 9 Jul 2025). AdaSFFuse’s learnable AdaWAT/Spatial-Frequency Mamba block improves SSIM over FFT and fixed wavelet baselines, with low parameter and compute cost (Wang et al., 21 Aug 2025). WIFE-Fusion’s intra- and inter-frequency cross-modal attention yields sharply preserved edges and detail in fused images (Zhang et al., 4 Jun 2025).
Social media/vetted multimodal content: Frequency-domain spectrum fusion modules in FSRU outperform spatial-only baselines for rumor detection, with dual contrastive objectives ensuring informative cross-modal frequency feature learning (Lao et al., 2023).
Deepfake detection: Hierarchical cross-modal fusion, combining frequency-aware attention at multiple hierarchy levels, maximizes spatial–frequency synergy and generalization on unseen forgeries (Qiao et al., 24 Apr 2025).

Summary ablation and comparison tables (see FMCAF, RPFNet, AdaSFFuse sections) consistently demonstrate that frequency-domain fusion modules contribute discrete, quantifiable improvements in most multimodal benchmarks.

6. Open Challenges and Research Directions

While Cross-Modal Frequency Fusion modules demonstrate substantial empirical gains, several open directions are under active investigation:

Robustness to misalignment and heterogeneity: Continued work addresses spatial, spectral, or semantic misalignment, developing alignment-aware attention and tokenization strategies (Sami et al., 12 Mar 2025, Zhang et al., 4 Jun 2025).
Adaptive frequency representation: Research explores learnable, task-adaptive decomposition (AdaWAT), prompting, and masking methods to match varying data distributions and undersampling conditions (Wang et al., 21 Aug 2025, Li et al., 5 Dec 2025).
Model efficiency: Designs such as channel-mixing in frequency space, shallow parameter-efficient state-space blocks (Mamba2), and invertible frequency modules are advancing compactness and real-time applicability (Zheng et al., 9 Jul 2025, Zou et al., 27 Jun 2024, Wang et al., 21 Aug 2025).
Interpretation and visualization: Understanding the interpretable role of learned frequency filters, attention weights, and alignment mappings remains a priority, especially for clinical and critical real-world deployments.

7. Representative Implementations and Summary Table

The following table summarizes distinctive implementations:

Paper	Frequency Decomposition	Fusion Mechanism	Application
FDCT (Sami et al., 12 Mar 2025)	Learned low/high split	Token-space alignment & alignment losses	ATR/image cls.
UniFS (Li et al., 5 Dec 2025)	FFT (Amplitude/Phase)	Prompt-guided fusion, 1×1 spectral conv	MRI recon
FreDFT (Wu et al., 13 Nov 2025)	FFT per channel	FFT cross-attn (MFDA), multi-scale FFD	Object detection
RPFNet (Zheng et al., 9 Jul 2025)	FFT per channel	1×1 freq. conv (global mixing)	Image fusion
WIFE-Fusion (Zhang et al., 4 Jun 2025)	DWT	Intra/inter-frequency cross-attn	Image fusion
FMCAF (Berjawi et al., 20 Oct 2025)	FFT, learned mask	Filtered cross-attn (MCAF)	Obj. det.
AdaSFFuse (Wang et al., 21 Aug 2025)	AdaWAT (learnable)	Spatial-Freq. Mamba block, 2D-SSD update	Image fusion
FSRU (Lao et al., 2023)	1D-DFT on seq/patch	Compressed cross-modal spectrum masks	Rumor det.
MMR-Mamba (Zou et al., 27 Jun 2024)	FFT (Amplitude/Phase)	Selective frequency fusion, ASFF gating	MRI recon

This suggests that frequency-informed cross-modal fusion is now a critical, general-purpose architectural motif across diverse domains, advances a variety of robust, efficient, and generalizable multimodal perception tasks, and continues to be a focus of methodological refinement and ablation-driven optimization.