AdaFuse: Adaptive Medical Image Fusion

Updated 2 May 2026

The paper introduces a novel frequency-guided cross-attention fusion mechanism that adaptively combines spatial and frequency features from co-registered modalities.
Adaptive fusion leverages dual branches—spatial cross-attention and Fourier-guided fusion—to preserve high-frequency details and low-frequency contrast in medical imaging.
Empirical results demonstrate significant improvements in PSNR, MI, and structural preservation compared to established methods, validating AdaFuse's clinical effectiveness.

Multimodal image fusion refers to the integration of complementary data from heterogeneous image modalities (e.g., CT, MRI, PET) into a single composite image that retains diagnostically relevant details from each source. AdaFuse, introduced as "Adaptive Medical Image Fusion Based on Spatial-Frequential Cross Attention," represents a contemporary paradigm that adaptively merges spatial and frequency-domain features using a novel cross-attention mechanism, targeting key limitations of previous deep-learning fusion schemes in distinguishing and adaptively combining modality-specific low- and high-frequency information (Gu et al., 2023).

1. Motivations and Problem Definition

Most classic and early deep-learning fusion frameworks lack explicit mechanisms to differentiate and adaptively weigh the contributions of diverse modalities’ low- versus high-frequency components. This constraint hampers the preservation of structural details (high-frequency, e.g., bone edges in CT) and global contrast (low-frequency, e.g., soft tissue in MR). AdaFuse was proposed to overcome these weaknesses by introducing a frequency-guided cross-attention mechanism, enabling spatially and spectrally adaptive image fusion, especially for clinical applications requiring precise tissue delineation (Gu et al., 2023).

2. Architecture: Spatial-Frequential Fusion with Cross Attention

AdaFuse’s architecture is organized as a deep multiscale encoder-fusion-decoder pipeline. Two co-registered images $I_1, I_2$ are encoded through convolutional modules across four scales. At every scale $j$ , spatial features are extracted and processed through:

Cross-Attention Fusion (CAF) blocks: Feature maps $\phi_j^1, \phi_j^2$ from each modality are fused in the spatial domain via a cross transformer attention mechanism, producing spatially fused tokens $\psi_{j}^{f_s}$ .
Fourier-Guided Fusion Branch (FGFB): Each feature map undergoes a 2D Fourier transform and log-magnitude encoding to yield frequency features $\widetilde\phi_j^i = \log(|FT(\phi_j^i)|+\varepsilon)$ . These are fused using CAF in the frequency domain and then mapped back to spatial space via inverse FT.
Spatial-Frequential Cross-Attention: The output of the spatial and frequency fusion blocks are further fused using a final CAF block, ensuring coupling between local details and broad semantic content.

The decoder reconstructs the fused image $I_f$ by upsampling and concatenating multi-scale fused features.

Key cross-attention computations in CAF operate on projected queries ( $Q$ ), keys ( $K$ ), and values ( $V$ ), with bidirectional attention scores. Fusion is computed as a weighted sum of all possible value exchanges between modalities, effectively allowing adaptive mixing dependent on both spatial and frequential relationships.

3. Frequency Decomposition and Feature Processing

AdaFuse leverages explicit frequency decomposition to target the preservation and adaptive treatment of high-frequency structural details. The procedure consists of:

2D Fourier transform on spatial features to obtain frequency information.
Log-magnitude mapping of Fourier coefficients, highlighting high-frequency content (corresponding to fine details).
Frequency-domain CAF fusion: Frequency representations from both modalities are fused with cross-attention, enhancing transfer of salient details (e.g., sharp boundaries in CT, soft contrast in MRI).
Inverse FT returns fused frequency-domain features to spatial domain, serving as inputs for the final cross-attention fusion stage.

This dual-branch approach ensures both low-frequency content and high-frequency details are adaptively weighted and integrated.

4. Learning Objective: Combined Content and Structure Loss

The AdaFuse training regime employs a composite loss:

Content Loss: An L2 distance between the fused output and the arithmetic mean of source images,

$\mathcal{L}_{\mathrm{content}} = \sum_{x,y}\left\|I_f(x,y) - \frac{I_1(x,y)+I_2(x,y)}{2}\right\|_2^2.$

Structure Loss: A sum of:
- Gradient-tensor loss based on structural tensors $j$ 0, penalizing differences in local orientation/strength of image gradients between the fused and source images.
- SSIM loss to explicitly enforce structural resemblance,
$j$ 1
Total Loss:

$j$ 2

with all hyperparameters set to 0.5 for CT/MRI fusion in the original experiments.

This multi-term loss ensures that global contrast, fine structure, and perceptual similarity are all jointly optimized.

5. Empirical Evaluation and Comparative Results

AdaFuse was benchmarked on multimodal medical datasets (Harvard AANLIB: CT-MRI, PET-MRI, SPECT-MRI), using entropy (EN), PSNR, mutual information (MI), correlation coefficient (CC), and DCT-based feature MI (FMI) as metrics.

Method	EN	PSNR	MI	CC	FMI
IFCNN	4.605±0.24	62.951±0.80	2.817±0.18	0.803±0.03	0.380±0.01
U2Fusion	4.792±0.24	63.976±0.80	2.679±0.20	0.824±0.03	0.230±0.01
SwinFusion	4.685±0.22	62.888±0.82	2.979±0.18	0.797±0.02	0.361±0.01
AdaFuse	5.059±0.23	64.001±0.77	3.357±0.19	0.831±0.02	0.427±0.01

AdaFuse preserves tissue and bone contrast, retains fine edges, and avoids color artifacts in both soft tissue (MRI) and hard tissue (CT or PET/SPECT) fusion scenarios.

6. Ablation Analyses and Fusion Strategy Insights

Comprehensive ablation experiments reveal:

CAF block superiority: Cross-attention fusion outperforms hand-crafted rules (average, L1-norm, max) in MI, FMI, and CC, confirming the necessity of adaptive attention mechanisms.
Impact of Fourier-Guided Branch: Removal of FGFB increases image entropy but degrades PSNR and high-frequency detail, evidencing the role of frequency-domain attention for sharpness.
Loss composition: Isolating content or structure loss leads to either over-smooth (content-only) or over-sharp/contrast-deficient (structure-only) results. Only the unified loss yields both correct contrast and sharp edges.

This supports the architectural rationale for spatial-frequency dual-branch fusion and composite optimization objectives.

AdaFuse typifies the trend toward adaptive, multi-domain attention architectures in fusion—contrasted with wavelet-based (e.g., adaptive PSO–optimized DT-CWT (Deepika et al., 2020)), generative (e.g., flow-matching (Zhu et al., 17 Nov 2025)), and reliability- or causality-aware frameworks (e.g., W-DUALMINE (Islam, 13 Jan 2026), causal intervention-based fusion (Wang et al., 24 Mar 2026)). W-DUALMINE addresses global-local trade-offs and improves mutual information further by incorporating reliability-weighted dual-expert pathways, while causal-stable fusion methods tackle robustness to spurious associations by training invariance gates under explicit interventions.

A plausible implication is that AdaFuse’s spatial-frequential cross-attention framework is highly effective for fusing two well-aligned medical modalities where preservation of both high-frequency structure and low-frequency contrast is required, but benefits may diminish in scenarios with misregistration, unbalanced information quality, or broader cross-task requirements unless further extended (e.g., by reliability weighting or causal invariance mechanisms). The current evidence base is restricted to 2D slice-wise fusion; volumetric or 3D approaches exploit additional anatomical context (Liu et al., 2023).

References

"AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential Cross Attention" (Gu et al., 2023)
"A Novel adaptive optimization of Dual-Tree Complex Wavelet Transform for Medical Image Fusion" (Deepika et al., 2020)
"FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching" (Zhu et al., 17 Nov 2025)
"W-DUALMINE: Reliability-Weighted Dual-Expert Fusion With Residual Correlation Preservation for Medical Image Fusion" (Islam, 13 Jan 2026)
"Multi-Modal Image Fusion via Intervention-Stable Feature Learning" (Wang et al., 24 Mar 2026)
"Three-Dimensional Medical Image Fusion with Deformable Cross-Attention" (Liu et al., 2023)