Infrared and Visible Image Fusion

Updated 24 December 2025

Infrared and visible image fusion is a technique that integrates thermal cues and detailed textures to produce images with enhanced clarity and robust scene understanding.
It employs a range of methodologies from transform-based and deep learning approaches to Bayesian and hybrid models for effective feature integration.
This fusion improves performance in applications such as autonomous driving, surveillance, and human activity recognition by overcoming sensor limitations.

Infrared and visible image fusion is the process of integrating complementary information from co-registered infrared (IR) and visible (VIS) images to generate a single output that exhibits salient thermal targets (from IR) and rich texture or color features (from VIS). This modality synergy is pivotal for applications such as robust perception in autonomous driving, surveillance under challenging illumination, human activity recognition, and multispectral scene understanding.

1. Fundamentals and Motivations

Infrared sensors capture thermal radiation, effectively highlighting warm objects like people or vehicles with high contrast against backgrounds, especially in poor lighting or weather. However, IR images lack fine spatial detail and color. In contrast, visible images contain rich edge, texture, and color information but often fail to delineate thermal objects under occlusion or adverse lighting (Zhang et al., 2020, Yang et al., 2023, Yang et al., 5 May 2025). Fusing these modalities aims to obtain a unified image with prominent targets and natural scene detail, directly enabling downstream high-level tasks such as detection and segmentation (Hu et al., 30 Oct 2024, Jiang et al., 14 Jul 2024).

Fusion must address heterogeneous characteristics: spectral differences, noise properties, possible geometric misalignments, and the inherent absence of a ground-truth fused image for supervision or evaluation. Early works focused on handcrafted pixel- or transform-domain rules, while recent advances leverage learning-based or hybrid model-data-driven approaches for spatial, frequency, and semantic information integration (Yang et al., 2023, Chen et al., 27 Jun 2024).

2. Methodological Taxonomy

Infrared-visible image fusion encompasses a broad taxonomy of methodologies, including:

a. Transform and Filter-based Methods

Traditional approaches, such as multiscale (Gaussian/Laplacian pyramid [GFF]), wavelet, and contourlet transforms, decompose each source image into frequency subbands. Low-frequency/base layers are often fused by weighted averaging; high-frequency/detail coefficients use max-selection or saliency-guided fusion (Zhang et al., 2020). Sparse subspace models (e.g., LatLRR, MST_SR) operate in such domains with coefficient selection by activity measures.

b. Deep Learning and Autoencoder-based Methods

Autoencoders (AEs) have been extensively adopted to learn shared and complementary feature representations (Li et al., 2018, Zhao et al., 2020, Zhang et al., 2022). Dual-stream (DIDFuse (Zhao et al., 2020)) or triple-branch (ICAFusion (Wang et al., 2022)) AEs decompose inputs into modality-specific branches and fuse features via spatial or channel attention mechanisms. Modern networks integrate dense blocks for shallow texture, Res2Net for multi-scale receptive fields (Song et al., 2021), or separate detail and semantic branches to simultaneously capture edge/textural and contextual features (Fu et al., 2021).

c. Decomposition-Fusion Paradigms

Retinex theory-inspired decomposition separates each input into illumination and reflectance components, which are then fused independently and recombined, as in SimpleFusion (Chen et al., 27 Jun 2024) and IAIFNet (Yang et al., 2023). Algorithm unrolling (AUIF (Zhao et al., 2020)) implements classical optimization-based two-scale decompositions as interpretable stacked convolutional layers, combining the strengths of priors and end-to-end training.

d. Frequency- and Hybrid Domain Techniques

Methods such as SFDFusion (Hu et al., 30 Oct 2024) explicitly learn and fuse features from both spatial and frequency (FFT) domains in parallel. Frequency fusion enhances global structure and edge contents, complementing spatial gradient- or saliency-based cues.

e. Bayesian and Statistical Models

BayesianFusion (Zhao et al., 2020) and QIVIF (Yang et al., 5 May 2025) formulate fusion as a hierarchical Bayesian regression with total-variation or quaternion priors. They adaptively infer local uncertainties for robust thermal–texture balancing, and, in the quaternion case, handle full color channel correlations.

f. Implicit Neural Representations (INR)

INRFuse (Sun et al., 20 Jun 2025) parameterizes the fused image as a continuous function over spatial coordinates, using MLPs with periodic activations to achieve resolution-independent, per-pixel fusion and super-resolution capabilities, trained unsupervised for each image pair.

g. Semantic- and Task-driven Fusion

Recent methods leverage semantic priors or high-level tasks. HSFusion (Jiang et al., 14 Jul 2024) aligns semantic and geometric content via dual CycleGAN domain transforms, guided by segmentation masks. "From Text to Pixels" (Li et al., 2023) integrates CLIP-based textual prompts as context-aware semantic priors, and ties fusion with object detection via bilevel optimization.

3. Fusion Mechanisms and Attention Strategies

A common structure in modern frameworks is the encoder–fusion–decoder paradigm, with modality-specific or shared-weight feature extractors. Fusion strategies have evolved from fixed rules to adaptive attention modules:

Saliency/Spatial Attention: Pixel-wise weights are derived by saliency computation—e.g., histogram-based pixel weight (SAM in DIDFuse (Zhao et al., 2020)), or L1-norm softmax (DenseFuse (Li et al., 2018), Res2NetFuse (Song et al., 2021)).
Channel Attention: Channel activations are normalized between modalities for channel-wise re-weighting (CAM) (Zhao et al., 2020, Wang et al., 2022).
Dual-Attention and Multi-scale: MDA (Yang et al., 2023) computes both spatial and channel attention at multiple scales; output fusion adapts per region and per feature map.
Semantic-Region Masks: ISDM in HSFusion (Jiang et al., 14 Jul 2024) produces segmentation-driven binary region masks that modulate fusion weights, enhancing thermal or visible guidance per semantic class.
Frequency-Attention: SFDFusion (Hu et al., 30 Oct 2024) separately processes amplitude and phase in the frequency domain, combining them via learned small Conv-blocks.
Text or Context Priors: CLIP embeddings inject high-level semantic relationships to guide fusion and enhance robustness in cross-modal, task-driven settings (Li et al., 2023).

4. Objective Functions and Training

Loss design in IVIF aims to balance preservation of complementary and redundant information, structural or perceptual fidelity, and, when relevant, downstream performance. Typical terms include:

Pixel and Structural (SSIM/MS-SSIM): Penalize L2 distance and enforce perceptual similarity to input images (Zhao et al., 2020, Li et al., 2018, Hu et al., 30 Oct 2024).
Content Consistency and Gradient: Encourage retention of high-frequency details, often through gradient or Laplacian terms (Fu et al., 2021, Yang et al., 2023, Wang et al., 2022).
Saliency/Adaptive-Weighted Losses: Use saliency maps or adaptive weights for region- or modality-importance, often computed from VGG or U^2Net feature activations (Yang et al., 2023, Hu et al., 30 Oct 2024).
Frequency/Correlation Losses: Explicitly match statistics or correlations in the frequency domain between fused image and sources (Hu et al., 30 Oct 2024).
Codebook Quantization and Commitment: In VQ-VAE and similar designs, codebook vector quantization is regularized to encourage discrete representation learning for detector compatibility (Li et al., 2023).
Task-driven and Bilevel Losses: Downstream detection or segmentation (YOLOv5/ViT-Adapter) losses are jointly or sequentially minimized, in combination with image fusion objectives (Li et al., 2023, Jiang et al., 14 Jul 2024).

Optimization is accomplished via Adam or similar stochastic gradient methods, with batch-wise mini-steps and careful scheduling. Classical and hybrid models (BayesianFusion, QIVIF) employ EM and ADMM, sometimes in the quaternion domain (Zhao et al., 2020, Yang et al., 5 May 2025).

5. Evaluation Metrics and Benchmarks

Objective quality measures for IR–VIS fusion, as standardized in VIFB (Zhang et al., 2020), can be grouped:

Metric Type	Metric Name & Description	Key References
Information-theoretic	Entropy (EN), Mutual Information (MI), Cross-Ent.	(Zhang et al., 2020, Yang et al., 5 May 2025)
Edge/Structural	Average Gradient (AG), Q^{AB/F}, SSIM, SCD, SF	(Zhang et al., 2020, Hu et al., 30 Oct 2024)
Perceptual	MS-SSIM, VIF, PaQ-2-PiQ	(Ataman et al., 11 Dec 2024, Hu et al., 30 Oct 2024)
Task-driven	[email protected], mIoU (YOLO, ViT-Adapter detection/seg)	(Jiang et al., 14 Jul 2024, Li et al., 2023)

No single method dominates all metrics universally. Multi-scale and attention-based DL approaches (DenseFuse (Li et al., 2018), DIDFuse (Zhao et al., 2020), SFDFusion (Hu et al., 30 Oct 2024), Bayesian or quaternion models (Zhao et al., 2020, Yang et al., 5 May 2025)) are typically superior across diverse fusion and perceptual metrics. Task-specific and semantically guided methods demonstrate the best downstream detection and segmentation performance on multispectral datasets (Li et al., 2023, Jiang et al., 14 Jul 2024).

6. Application Domains and Datasets

Infrared–visible fusion is foundational in scenarios with variable illumination, including night-time or bad-weather driving, surveillance, search and rescue, and military reconnaissance. Public benchmarks include:

VIFB (21 pairs, RGB/IR, varied illumination) (Zhang et al., 2020)
FLIR, TNO, KAIST, MSRS, RoadScene, M³FD (extensive IR–VIS pairs with application diversity) (Hu et al., 30 Oct 2024, Yang et al., 2023, Ataman et al., 11 Dec 2024)
Text Prompt Dataset (paired IR/VIS plus scene/target text annotations) (Li et al., 2023)
LLVIP (Low-light, 290 pairs) (Yang et al., 2023)

Datasets typically provide aligned grayscale or color IR/VIS images, with or without object annotations for detection/segmentation evaluation.

7. Current Trends and Open Challenges

Recent IVIF research trends include:

Semantic and High-level Task Integration: Fusion is optimized for subsequent object detection or segmentation via explicit supervision, semantic priors, or end-to-end coupled learning (Li et al., 2023, Jiang et al., 14 Jul 2024).
Modality-aware and Dynamic Attention: Fine-grained spatial, channel, and even codebook-based attention mechanisms adapt to local scene properties (Zhao et al., 2020, Wang et al., 2022, Yang et al., 2023).
Physics- and Domain-driven Losses: Retinex illumination modeling, frequency/phase consistency, and quaternion or Bayesian priors improve robustness in degraded scenes (Chen et al., 27 Jun 2024, Hu et al., 30 Oct 2024, Yang et al., 5 May 2025).
Resource- and Deployment-Efficient Designs: Lightweight, plain-CNN, fast inference (sub-10 ms per 256×256 image) now enable embedded real-time fusion on edge hardware (Ataman et al., 11 Dec 2024, Hu et al., 30 Oct 2024).
Resolution-independence and Unsupervised Fusion: Implicit neural representations and image-specific learning decouple from dataset scale and support super-resolution fusion (Sun et al., 20 Jun 2025).

Persisting open questions include handling severe modality misalignment, fully addressing color/illumination bias, developing consistently interpretable feature attribution, and benchmarking under extreme real-world (e.g., haze, rain, multi-sensor) conditions.

References

Key contributions foundational to the above survey include (Zhang et al., 2020, Zhao et al., 2020, Li et al., 2018, Song et al., 2021, Hu et al., 30 Oct 2024, Yang et al., 2023, Zhang et al., 2022, Yang et al., 2023, Chen et al., 27 Jun 2024, Ataman et al., 11 Dec 2024, Li et al., 2023, Jiang et al., 14 Jul 2024, Sun et al., 20 Jun 2025, Yang et al., 5 May 2025, Zhao et al., 2020).