Self-Supervised Fusion Loss

Updated 22 November 2025

Self-Supervised Fusion Loss is a class of objective functions that enable unlabeled modality fusion using statistical, edge-preservation, and contrastive losses.
It employs techniques like SSIM, gradient alignment, and decomposition losses to retain complementary information from diverse sensor inputs.
Adaptive network architectures and meta-learned loss modules ensure robust performance for tasks such as segmentation, classification, and denoising.

Self-supervised fusion loss refers to a class of objective functions that supervise modality fusion—spatial, spectral, temporal, or multimodal—without requiring annotated ground-truth fusion outputs. These losses are foundational in developing deep learning methods for fusing heterogeneous sensor data, achieving both low-level information integration and, in advanced designs, optimizing fused representations for downstream tasks such as segmentation or classification. The following sections provide a rigorous overview of core loss formulations, mechanisms for enforcing complementary retention, representative architectures, and their impacts across diverse application domains.

1. Mathematical Foundations of Self-Supervised Fusion Losses

Self-supervised fusion losses operate without explicit fused ground-truth, leveraging statistics of the source inputs or priors on the intended properties of the fused output. Canonical approaches can be grouped as follows:

a. Structure of Similarity and Edge-Preservation Losses

In RGB-NIR image fusion, the self-supervised fusion loss is

$L_\text{total} = L_\text{SoS} + L_\text{EP}$

where

$L_\text{SoS} = L_\text{SSIM}(F, I_n) + L_\text{SSIM}(F, I_v)$

with

$L_\text{SSIM}(F, I) = 1 - \frac{1}{|\Omega|} \sum_{x\in\Omega} \text{SSIM}_x(F, I)$

and

$L_\text{EP} = EP(F, I_n) + EP(F, I_v),\quad EP(A,B) = \frac{1}{|\Omega|} \sum_{x\in\Omega} \|\nabla A(x) - \nabla B(x)\|_2^2$

This formulation guarantees that the fused output $F$ inherits both global structural similarity and sharpest edge patterns from all inputs (Ofir et al., 2023).

b. Contrastive and Mutual Information-Based Losses

Multimodal fusion can be enforced with InfoNCE-based contrastive losses at different scales:

Image-level (Global) Contrastive Loss

$\mathcal L_{\mathrm{img}} = -\frac{1}{N} \sum_{i=1}^N \log\frac{h(s_1^i,s_2^i)}{\sum_{j=1}^N h(s_1^i, s_2^j)}$

Super-pixel (Pixel) Contrastive Loss

$\mathcal L_{\mathrm{sp}} = -\frac{1}{K} \sum_{k=1}^K \log\frac{h(p_1^k,p_2^k)}{\sum_{\ell=1}^K h(p_1^k, p_2^\ell)}$

Contrastive learning is used both intra-domain (e.g., different augmentations of SAR or optical data) and cross-domain, often with a pixel- or local-feature alignment emphasis for dense prediction tasks (Chen et al., 2021, Wei et al., 2024).

c. Pretext Tasks and Decomposition Losses

Recent designs employ auxiliary synthesis tasks to drive universal fusion pretraining. For example, decomposition-based losses enforce networks to extract “common” and “unique” features from masked, noised input pairs:

$L_\text{CUD}(x^1,x^2) = \|PH(f_c) - [M_1 \odot x] \cap [M_2 \odot x]\|_1 + \cdots$

for both common ( $f_c$ ) and unique ( $f_u^i$ ) branches, plus masked token prediction or reconstruction constraints (Liang et al., 2024). This disentangles shared vs. modality-specific content.

d. Learnable or Adaptive Losses

Meta-learned losses, parameterized by neural subnetworks, are optimized to best supervise the underlying fusion network via a bi-level MAML scheme. The loss is enforced so that maximizing reconstruction fidelity of sources from the fused output guides the fusion process (Bai et al., 2023).

e. Advanced Task-Driven or Consistency Losses

For semantic fusion, high-level consistency losses enforce agreement between feature-level and pixel-level segmentation predictions without explicit manual annotations:

$L_\text{CSC} = \frac{1}{2}[L_\text{hyb}(\hat p^A,\tilde p) + L_\text{hyb}(\hat p^B, \tilde p)]$

where the hybrid loss $L_\text{hyb}$ combines cross-entropy and Dice losses on self-generated pseudo-labels (Zhao et al., 26 Sep 2025).

2. Mechanisms for Retention of Complementary Information

Self-supervised losses are constructed to guarantee that neither modality’s salient characteristics are lost:

SSIM/SoS terms: Preserve large-scale structure, contrast, and textural cues from all sources, as observed in infrared-visible fusion (Ofir et al., 2023).
Edge-Preservation/Gradient losses: Prevent blurring of discriminative boundary information, ensuring sharp contours from each channel are present in the fused result (Ofir et al., 2023).
Contrastive terms: Align corresponding local or global features across modalities, enforcing that fusion does not dilute discriminative signals required for segmentation or dense prediction (Chen et al., 2021, Wei et al., 2024).
Task consistency or multi-branch agreement: Elicit fusion representations that are maximally informative for both low-level perceptual quality and high-level semantic performance (Zhao et al., 26 Sep 2025).

3. Representative Architectures and Training Strategies

The integration of self-supervised fusion losses is tightly coupled with tailored network architectures:

Single-pair Compact CNNs: Four-layer compact CNNs with skip connections and parallel UNet-ResNet18 paths, designed for rapid single-image SSL on IR-visible fusion (Ofir et al., 2023).
Multi-branch and Intermediate Fusion Nets: Two-branch ResUNet encoders for SAR-optical data, supporting early, late, or most effectively, intermediate fusion, matching modality-specific pathways followed by feature map concatenation and joint decoding (Chen et al., 2021).
Transformer and Vision-Transformer Backbones: Cross-attention modules for abstracting common/unique features in decomposition-type SSL, with projector heads for reconstruction or feature re-mixing (Liang et al., 2024).
Meta-learned Fusion Loss Proposals: Restormer-based learnable loss modules drive adaptive pixel-wise weighting of intensity and gradient losses in a meta-learning feedback loop (Bai et al., 2023).
Auxiliary Training Pipelines: Pretext destruction methods (patch-wise nonlinear, gamma, or blur transformations) and masked-modality reconstruction are used to incentivize generalizable fusion representations, especially in transformers (Qu et al., 2022, Koupai et al., 2022).

Training can be executed over (a) a single input pair (small-scale, fast convergence), (b) large unlabeled datasets (massive pretraining), or utilizing patch-based self-supervision, often with adaptive optimizer settings such as Adam with learning rates in $[10^{-4},10^{-3}]$ .

4. Application Domains and Empirical Impact

Self-supervised fusion losses enable adaptation across a wide range of data domains:

Domain	Example Modalities	Task	Core Loss Components	Paper
Remote Sensing	SAR, Optical	Land-cover mapping	Pixel/global contrastive	(Chen et al., 2021)
Multispectral Imaging	NIR, Visible	Scene fusion	SSIM + Edge-preservation	(Ofir et al., 2023)
Medical Imaging	MRI (T1, T2, FLAIR)	Denoising	Denoising + domain-transfer	(Wagner et al., 2022)
Semantic Segmentation	Visible, Infrared	Segmentation-ready	Fusion + cross-segmentation	(Zhao et al., 26 Sep 2025)
Activity Recognition	CSI, PWR (RF signals)	Classification	Masked-modality reconstruction	(Koupai et al., 2022)
Multimodal Regression	Text, Audio, Video	Sentiment Analysis	Fusion+MI contrastive (InfoNCE)	(Nguyen et al., 2023)

Empirical validation consistently demonstrates that self-supervised fusion losses match or outperform conventional hand-crafted or supervised alternatives, especially in “label-efficient” regimes or with limited data accessibility. For instance, in semantic segmentation on fusion benchmarks, cross-segmentation consistency alone closes the gap with fully supervised, application-oriented fusion (Zhao et al., 26 Sep 2025). Likewise, meta-learned losses in ReFusion adaptively tune the fusion objective for modalities as diverse as medical and infrared (Bai et al., 2023).

5. Extensions and Open Problems

Recent trends focus on unifying low-level perceptual alignment with high-level semantic guidance, extending the applicability of self-supervised fusion loss to application-oriented designs (e.g., segmentation, detection) entirely without labelled data (Zhao et al., 26 Sep 2025). Other extensions include:

Meta-learned or adaptive loss functions that generalize across data types and fusion goals (Bai et al., 2023).
Decomposition-style pretext losses for universal fusion pretraining (Liang et al., 2024).
Domain- and modality-specific contrastive or distillation terms to exploit cross-domain neuroimaging or multi-modal sensor fusion (Wei et al., 2024). A major open question is the construction of fully unsupervised, task-adaptive losses that guarantee both perceptual and semantic optimality, together with theoretical analyses quantifying disentanglement and information preservation in fused representations.

6. Comparison with Alternative and Classical Approaches

Self-supervised fusion differs fundamentally from classical “unsupervised” methods that employ fixed losses (e.g., only SSIM, $L_1$ , or mutual information maximization), as self-supervised fusion loss can be (i) parameterized and meta-learned, (ii) multi-level (contrastive, perceptual, semantic), and (iii) task-adaptive. This confers significant gains in adaptability, generalization, and label-efficiency. In the context of tomographic denoising, for example, multi-contrast fusion loss both avoids the spatial-independence assumption of Noise2Void and outperforms it on SSIM/PSNR, as shown in the Noise2Contrast approach (Wagner et al., 2022).