Pixel-Level Fusion Methods Overview

Updated 20 December 2025

Pixel-level fusion methods are techniques that combine spatially aligned images at the individual pixel to enhance spatial and spectral details across different modalities such as remote sensing and medical imaging.
These methods employ classical algorithms like arithmetic combinations and wavelet transforms along with advanced neural networks to minimize spectral distortion while preserving critical spatial features.
Recent advancements leverage CNNs, transformers, and attention mechanisms to adaptively weight pixel contributions, improving fusion outcomes for diverse applications.

Pixel-level fusion methods refer to techniques that integrate multimodal or multiscale image data at the finest granularity—the individual pixel. These methods combine spatially aligned input images or feature maps, typically from different sensors (e.g., RGB and NIR, LiDAR and camera, PET and MRI), to synthesize a new image or feature field with enhanced information content. The principal aim is to maximize complementary detail, contrast, or semantic cues from each mode, while minimizing redundancy and spectral distortion. Pixel-level fusion is foundational in domains such as remote sensing, medical imaging, vision-based detection, and multi-modal learning. The landscape of pixel-level fusion encompasses both classical parametric algorithms (arithmetic, frequency domain, statistical, fuzzy logic) and advanced, data-driven or task-specific neural architectures.

1. Classical Pixel-Level Fusion Algorithms: Foundations and Formulations

Pixel-level fusion originated in remote sensing and medical imaging, driven by the need to combine high spatial resolution (e.g., panchromatic) and high spectral (multispectral) signals. Classical formulations include:

Arithmetic Combination Methods:
- Brovey Transform (BT):
- $F_k(i,j) = M_k(i,j)\times \dfrac{P(i,j)}{\sum_{\ell=1}^N M_\ell(i,j)}$ .
- Injects PAN spatial detail into each MS band by proportional scaling (Al-Wassai et al., 2011).
- Multiplicative and Color Normalization:
- Multiplicative: $F_k(i,j) = M_k(i,j) \times P(i,j)$ ;
- Color normalization variant adds bias and normalization to reduce spectral distortion.
Statistical Matching Methods:
- Local Mean Matching (LMM):
- $F_k(i,j) = P(i,j) \dfrac{\bar{M}_{k,w,h}(i,j)}{\bar{P}_{w,h}(i,j)}$
- where the bar denotes local means in a window. This preserves mean radiometry but not local contrast (Al-Wassai et al., 2011).
- Local Mean and Variance Matching (LMVM):
- $F_k(i,j) = [P(i,j) - \bar{P}_{w,h}(i,j)] \dfrac{\sigma_{M_k,w,h}(i,j)}{\sigma_{P,w,h}(i,j)} + \bar{M}_{k,w,h}(i,j)$
- which also matches local variance.
Frequency Domain and Multi-scale Approaches:
- Wavelet Transform-Based Fusion (WT):
- Apply a discrete wavelet transform to separably decompose high- and low-frequency content for PAN and each MS band. Replace MS detail coefficients with those from PAN and invert (Gharbia et al., 2014, Al-Wassai et al., 2011).
Fuzzy Logic-Based Fusion:

Apply fuzzy membership functions and a small rule base to assign each pixel a fused value considering uncertainty and smooth transitions (Dammavalam et al., 2013). The output is defuzzified using the centroid of the aggregated output membership.

Statistical Regression Methods:
- Regression Variable Substitution (RVS):
- In local windows, fit the linear regression of MS on PAN, substitute the PAN pixel, and reconstruct (Al-Wassai et al., 2011).

Each method operates under the assumption of spatial alignment and generally requires preprocessing by resampling and radiometric normalization.

2. Data-Driven and Content-Adaptive Pixel-Level Fusion

The emergence of data-driven techniques, particularly unsupervised and supervised CNNs, has led to advanced pixel-level fusion strategies that adaptively weight source contributions based on learned or measured content:

Spatial Masking with CNNs:
- MaskNet: Generates a pixel-wise mask $M(i,j)$ via a CNN from the stacked inputs; the fused image is $y(i,j) = M(i,j) x_1(i,j) + (1-M(i,j)) x_2(i,j)$ (Kumar et al., 2020).
- Weighted Averaging (Guided): Direct computation of relative influence per pixel, e.g., $w_1(i,j) = x_1(i,j) / (x_1(i,j) + x_2(i,j))$ .
Saliency or Structure-Guided Fusion:
- Superpixel-Based Fusion: Segment the inputs into superpixels (SLIC), compute local saliency (standard deviation) within each region, and design a smooth mask via a sigmoid of saliency difference. For each pixel, blend $F_{gray}(x) = m(x) I_1(x) + (1-m(x)) I_2(x)$ and reconstruct color by chrominance injection (Ofir et al., 2021).
- Granular-Ball Few-shot Fusion (GBFF): Model local pixel pairs as granular balls in brightness space, extract salient/non-salient pairs, compute a pixel-wise mask $X_{x,y}$ , and form a pseudo-supervised image $S_{x,y} = X_{x,y}A_{x,y} + (1-X_{x,y})B_{x,y}$ . The network is trained to fit $S$ with SSIM, Sobel, and Laplacian losses. Boundary and positive regions (as determined by global mask statistics) modulate the regime of supervision (Deng et al., 11 Apr 2025).
Pixel-Region Hard Selection:
- FillIn Modality: Use superpixel segmentation as an unsupervised prior; regions that vanish upon down-sampling are “locked” to low-level features, forming a binary mask that strictly selects between low- and high-level representations. This hard, region-wise gating is parameter-free and exclusively routes small object regions to detail-rich sources (Liu et al., 2019).
Pixel-Aligned Multi-modal Feature Fusion:
- VPFusion/VPFNet: For 3D vision tasks, aligns features from camera (2D) and LiDAR (3D voxels) via projection, then fuses per-voxel by extracting the RoI-aligned pixel feature (from the associated camera feature map) and passing both through a parameter-driven fusion network. Special parameters (density, occlusion, contrast) guide per-voxel pixel-weighting (Wang et al., 2021, Mahmud et al., 2022).
- GeminiFusion: Pixel-wise transformer fusion in vision transformers, with layer-adaptive noise and a relation discriminator efficiently blending intra-modal self-attention and inter-modal cross-attention at each token (patch/location) (Jia et al., 3 Jun 2024).
Task-Driven or Attention-augmented CNN Fusion:
- YOLOMG Bimodal Fusion: Fuses RGB and motion-difference maps for small-object detection by adaptive channel weighting (per-channel MLP softmax), CBAM (channel and spatial attention), and injects the fused feature map directly into the YOLOv5 backbone (Guo et al., 10 Mar 2025).
- HyHDRNet: For dynamic HDR, applies pixel-level ghost attention (spatial softmax on per-pixel query-key dot product), gating (combining pixel/patch cues), and Transformer deformable sampling to resolve alignment and exposure artifacts (Yan et al., 2023).
- Task-Driven Pixel-level Fusion (TPF): Fuses RGB and TIR for tracking via a dedicated Pixel-level Fusion Adapter (PFA), using linear-complexity Mamba-based state-space models, progressive expert distillation, decoupled task representation, and dynamic template updating to ensure robust, discriminative fusion (Lu et al., 14 Mar 2025).

3. Practical Implementations: Computational and Statistical Performance

Implementation and assessment of pixel-level fusion methods require careful system design:

Computational Complexity:
- Classical arithmetic and frequency-based methods (IHS, Brovey, HPFA, HFA, HFM) run in $O(N)$ or $O(JN)$ time; amenable to real-time applications.
- Superpixel-based algorithms also achieve $O(N)$ via efficient region statistics and SLIC segmentation (Ofir et al., 2021).
- Modern pixel-wise transformer modules or state-space models (Mamba-based) are designed for linear complexity, permitting real-time or high-throughput deployment (Jia et al., 3 Jun 2024, Lu et al., 14 Mar 2025).
Performance Metrics:
- Spectral metrics: Correlation Coefficient (CC), Normalized RMSE (NRMSE), Deviation Index (DI), Mutual Information (MIM), IQI.
- Spatial metrics: Edge Preservation, Entropy, Standard Deviation (SD), Average Gradient (AG), Visual Information Fidelity (VIF), Structural Similarity Index (SSIM) (Gharbia et al., 2014, Dammavalam et al., 2013, Ofir et al., 2021).
- Task-specific metrics: For detection or segmentation scenarios, mean Average Precision (mAP), mIoU, or task-driven aggregate scores are preferred (Wang et al., 2021, Guo et al., 10 Mar 2025).
Quantitative Findings:
- Statistical methods like RVS and LCM typically offer the best trade-off between spectral fidelity and spatial sharpening among the classical family (CC $\approx$ 0.93–0.94) (Al-Wassai et al., 2011).
- Superpixel-based, granular-ball, and region-gated masks, as well as attention-based deep fusion, outperform conventional global (PCA, α-blending) and transform (wavelet, Laplacian pyramid) fusion in detail preservation and perceptual metrics (Ofir et al., 2021, Deng et al., 11 Apr 2025).

4. Advances in Neural Pixel-Level Fusion: Supervision and Task Adaptation

Neural and transformer-based methods have extended pixel-level fusion into task-adaptive scenarios:

Pixel-level Supervision for Universal Fusion:
- GIFNet: Employs pixel-level MSE and SSIM losses in digital photography fusion (multi-focus, multi-exposure) to enforce a dense, low-level feature space shared by multimodal branches. Cross-fusion gating modules enable dynamic feature blending, yielding task-agnostic fusion capabilities (Cheng et al., 27 Feb 2025).
Interpretable Pixel-level Fusion:
- FuseVis: Enables real-time per-pixel saliency analysis of CNN fusion architectures, visualizing spatial gradients and guidance maps to assess which input pixels influence each fused output, highlighting the need for architectures like MaskNet that maintain expected clinical/semantic pixelwise correspondence (Kumar et al., 2020).
Few-shot Fusion with Prior-Driven Masking:
- GBFF: Constructs a pseudo-supervised mask from precomputed granular-ball statistics (histogram-local) in intensity space, requiring only a handful of training pairs. Adaptation of loss functions to local positive/boundary region statistics enables robust generalization across fusion tasks from very limited data (Deng et al., 11 Apr 2025).

5. Application-Specific and Domain-Aware Pixel-Level Fusion

Design choices in pixel-level fusion are strongly influenced by the characteristics and requirements of the application domain:

Remote Sensing:

Demands high spatial resolution and accurate preservation of multispectral signatures. Pixel-level frequency methods (wavelet, high-frequency addition) and adaptive statistical algorithms (RVS, LMVM) are preferred depending on whether spatial or spectral fidelity is prioritized (Gharbia et al., 2014, Al-Wassai et al., 2011).

Medical Imaging:

Fuzzy logic fusion has demonstrated higher image quality index (IQI), mutual information, and lower RMSE compared to wavelet and GA-weighted DWT fusion (Dammavalam et al., 2013). CNN-based MaskNet architectures, enhanced by per-pixel saliency visualization, provide transparency in clinical contexts (Kumar et al., 2020).

3D Vision (LiDAR+Camera):

Geometric alignment and fusion at the voxel–pixel level, as in VPFNet, enable substantial gains in small-object detection (e.g., pedestrians on KITTI), exploiting semantic and spatial synergy (Wang et al., 2021, Mahmud et al., 2022).

HDR Imaging and Dynamic Scene Deghosting:

Pixel-level ghost attention and gating modules work with local and global aggregation, providing fine motion/saturation suppression in HDR fusion for dynamic scenes (Yan et al., 2023).

Surveillance/Drone Detection:

Lightweight, per-channel attention fusion of motion and appearance enables robust detection of tiny targets in challenging video environments (Guo et al., 10 Mar 2025).

6. Limitations, Practical Constraints, and Future Directions

While pixel-level fusion remains foundational and highly versatile, several limitations persist:

Spectral–Spatial Trade-off:

Methods often enhance spatial detail at the expense of spectral distortion. Maintaining both, especially in multi-modal settings, requires content-adaptive or data-driven weighting (Al-Wassai et al., 2011, Gharbia et al., 2014, Cheng et al., 27 Feb 2025).

Reliance on Accurate Registration:

All pixel-level approaches presuppose strict spatial alignment between sources. Errors in registration propagate to artifacts in the fused output.

Rule Explosion and Parameter Selection:

In fuzzy and statistical frameworks, extending beyond two inputs or domains causes rule-base or parameter explosion, limiting direct scalability (Dammavalam et al., 2013).

Lack of Ground-Truth for Training:

Absence of true fused images for supervision in multimodal tasks is addressed in GBFF by constructing pseudo-supervision via granular-ball priors (Deng et al., 11 Apr 2025).

Computational Bottlenecks:

Attention-based transformers and multi-level fusion modules may become prohibitive at ultra-high-pixel count unless restricted to linear-complexity (e.g., Mamba, GeminiFusion) designs (Jia et al., 3 Jun 2024, Lu et al., 14 Mar 2025).

Recent advances focus on metabolizing diverse priors (granular balls, superpixels, explicit attention, saliency analysis), few-shot generalization, and expanding pixel-level supervision to ensure robustness and transparency across application domains. The interplay of adaptive weighting, efficient attention, and interpretable architectures constitutes an active frontier in pixel-level fusion research.