Papers
Topics
Authors
Recent
2000 character limit reached

Infrared and Visible Image Fusion

Updated 24 December 2025
  • Infrared and visible image fusion is a technique that integrates thermal cues and detailed textures to produce images with enhanced clarity and robust scene understanding.
  • It employs a range of methodologies from transform-based and deep learning approaches to Bayesian and hybrid models for effective feature integration.
  • This fusion improves performance in applications such as autonomous driving, surveillance, and human activity recognition by overcoming sensor limitations.

Infrared and visible image fusion is the process of integrating complementary information from co-registered infrared (IR) and visible (VIS) images to generate a single output that exhibits salient thermal targets (from IR) and rich texture or color features (from VIS). This modality synergy is pivotal for applications such as robust perception in autonomous driving, surveillance under challenging illumination, human activity recognition, and multispectral scene understanding.

1. Fundamentals and Motivations

Infrared sensors capture thermal radiation, effectively highlighting warm objects like people or vehicles with high contrast against backgrounds, especially in poor lighting or weather. However, IR images lack fine spatial detail and color. In contrast, visible images contain rich edge, texture, and color information but often fail to delineate thermal objects under occlusion or adverse lighting (Zhang et al., 2020, Yang et al., 2023, Yang et al., 5 May 2025). Fusing these modalities aims to obtain a unified image with prominent targets and natural scene detail, directly enabling downstream high-level tasks such as detection and segmentation (Hu et al., 30 Oct 2024, Jiang et al., 14 Jul 2024).

Fusion must address heterogeneous characteristics: spectral differences, noise properties, possible geometric misalignments, and the inherent absence of a ground-truth fused image for supervision or evaluation. Early works focused on handcrafted pixel- or transform-domain rules, while recent advances leverage learning-based or hybrid model-data-driven approaches for spatial, frequency, and semantic information integration (Yang et al., 2023, Chen et al., 27 Jun 2024).

2. Methodological Taxonomy

Infrared-visible image fusion encompasses a broad taxonomy of methodologies, including:

a. Transform and Filter-based Methods

Traditional approaches, such as multiscale (Gaussian/Laplacian pyramid [GFF]), wavelet, and contourlet transforms, decompose each source image into frequency subbands. Low-frequency/base layers are often fused by weighted averaging; high-frequency/detail coefficients use max-selection or saliency-guided fusion (Zhang et al., 2020). Sparse subspace models (e.g., LatLRR, MST_SR) operate in such domains with coefficient selection by activity measures.

b. Deep Learning and Autoencoder-based Methods

Autoencoders (AEs) have been extensively adopted to learn shared and complementary feature representations (Li et al., 2018, Zhao et al., 2020, Zhang et al., 2022). Dual-stream (DIDFuse (Zhao et al., 2020)) or triple-branch (ICAFusion (Wang et al., 2022)) AEs decompose inputs into modality-specific branches and fuse features via spatial or channel attention mechanisms. Modern networks integrate dense blocks for shallow texture, Res2Net for multi-scale receptive fields (Song et al., 2021), or separate detail and semantic branches to simultaneously capture edge/textural and contextual features (Fu et al., 2021).

c. Decomposition-Fusion Paradigms

Retinex theory-inspired decomposition separates each input into illumination and reflectance components, which are then fused independently and recombined, as in SimpleFusion (Chen et al., 27 Jun 2024) and IAIFNet (Yang et al., 2023). Algorithm unrolling (AUIF (Zhao et al., 2020)) implements classical optimization-based two-scale decompositions as interpretable stacked convolutional layers, combining the strengths of priors and end-to-end training.

d. Frequency- and Hybrid Domain Techniques

Methods such as SFDFusion (Hu et al., 30 Oct 2024) explicitly learn and fuse features from both spatial and frequency (FFT) domains in parallel. Frequency fusion enhances global structure and edge contents, complementing spatial gradient- or saliency-based cues.

e. Bayesian and Statistical Models

BayesianFusion (Zhao et al., 2020) and QIVIF (Yang et al., 5 May 2025) formulate fusion as a hierarchical Bayesian regression with total-variation or quaternion priors. They adaptively infer local uncertainties for robust thermal–texture balancing, and, in the quaternion case, handle full color channel correlations.

f. Implicit Neural Representations (INR)

INRFuse (Sun et al., 20 Jun 2025) parameterizes the fused image as a continuous function over spatial coordinates, using MLPs with periodic activations to achieve resolution-independent, per-pixel fusion and super-resolution capabilities, trained unsupervised for each image pair.

g. Semantic- and Task-driven Fusion

Recent methods leverage semantic priors or high-level tasks. HSFusion (Jiang et al., 14 Jul 2024) aligns semantic and geometric content via dual CycleGAN domain transforms, guided by segmentation masks. "From Text to Pixels" (Li et al., 2023) integrates CLIP-based textual prompts as context-aware semantic priors, and ties fusion with object detection via bilevel optimization.

3. Fusion Mechanisms and Attention Strategies

A common structure in modern frameworks is the encoder–fusion–decoder paradigm, with modality-specific or shared-weight feature extractors. Fusion strategies have evolved from fixed rules to adaptive attention modules:

  • Saliency/Spatial Attention: Pixel-wise weights are derived by saliency computation—e.g., histogram-based pixel weight (SAM in DIDFuse (Zhao et al., 2020)), or L1-norm softmax (DenseFuse (Li et al., 2018), Res2NetFuse (Song et al., 2021)).
  • Channel Attention: Channel activations are normalized between modalities for channel-wise re-weighting (CAM) (Zhao et al., 2020, Wang et al., 2022).
  • Dual-Attention and Multi-scale: MDA (Yang et al., 2023) computes both spatial and channel attention at multiple scales; output fusion adapts per region and per feature map.
  • Semantic-Region Masks: ISDM in HSFusion (Jiang et al., 14 Jul 2024) produces segmentation-driven binary region masks that modulate fusion weights, enhancing thermal or visible guidance per semantic class.
  • Frequency-Attention: SFDFusion (Hu et al., 30 Oct 2024) separately processes amplitude and phase in the frequency domain, combining them via learned small Conv-blocks.
  • Text or Context Priors: CLIP embeddings inject high-level semantic relationships to guide fusion and enhance robustness in cross-modal, task-driven settings (Li et al., 2023).

4. Objective Functions and Training

Loss design in IVIF aims to balance preservation of complementary and redundant information, structural or perceptual fidelity, and, when relevant, downstream performance. Typical terms include:

Optimization is accomplished via Adam or similar stochastic gradient methods, with batch-wise mini-steps and careful scheduling. Classical and hybrid models (BayesianFusion, QIVIF) employ EM and ADMM, sometimes in the quaternion domain (Zhao et al., 2020, Yang et al., 5 May 2025).

5. Evaluation Metrics and Benchmarks

Objective quality measures for IR–VIS fusion, as standardized in VIFB (Zhang et al., 2020), can be grouped:

Metric Type Metric Name & Description Key References
Information-theoretic Entropy (EN), Mutual Information (MI), Cross-Ent. (Zhang et al., 2020, Yang et al., 5 May 2025)
Edge/Structural Average Gradient (AG), Q{AB/F}, SSIM, SCD, SF (Zhang et al., 2020, Hu et al., 30 Oct 2024)
Perceptual MS-SSIM, VIF, PaQ-2-PiQ (Ataman et al., 11 Dec 2024, Hu et al., 30 Oct 2024)
Task-driven [email protected], mIoU (YOLO, ViT-Adapter detection/seg) (Jiang et al., 14 Jul 2024, Li et al., 2023)

No single method dominates all metrics universally. Multi-scale and attention-based DL approaches (DenseFuse (Li et al., 2018), DIDFuse (Zhao et al., 2020), SFDFusion (Hu et al., 30 Oct 2024), Bayesian or quaternion models (Zhao et al., 2020, Yang et al., 5 May 2025)) are typically superior across diverse fusion and perceptual metrics. Task-specific and semantically guided methods demonstrate the best downstream detection and segmentation performance on multispectral datasets (Li et al., 2023, Jiang et al., 14 Jul 2024).

6. Application Domains and Datasets

Infrared–visible fusion is foundational in scenarios with variable illumination, including night-time or bad-weather driving, surveillance, search and rescue, and military reconnaissance. Public benchmarks include:

Datasets typically provide aligned grayscale or color IR/VIS images, with or without object annotations for detection/segmentation evaluation.

Recent IVIF research trends include:

Persisting open questions include handling severe modality misalignment, fully addressing color/illumination bias, developing consistently interpretable feature attribution, and benchmarking under extreme real-world (e.g., haze, rain, multi-sensor) conditions.

References

Key contributions foundational to the above survey include (Zhang et al., 2020, Zhao et al., 2020, Li et al., 2018, Song et al., 2021, Hu et al., 30 Oct 2024, Yang et al., 2023, Zhang et al., 2022, Yang et al., 2023, Chen et al., 27 Jun 2024, Ataman et al., 11 Dec 2024, Li et al., 2023, Jiang et al., 14 Jul 2024, Sun et al., 20 Jun 2025, Yang et al., 5 May 2025, Zhao et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Infrared and Visible Image Fusion.