Intrinsic Image Fusion
- Intrinsic Image Fusion is a computational technique that disentangles scene properties like albedo, roughness, and shading from images using single-view priors and multi-view fusion.
- It integrates diffusion-based generative models, low-dimensional parametric representations, and robust optimization to produce globally consistent material estimates.
- The approach achieves state-of-the-art performance in 3D material reconstruction, delivering improved PSNR, SSIM, and reduced error metrics compared to traditional methods.
Intrinsic Image Fusion is an advanced computational strategy for recovering physically meaningful scene properties—such as albedo, roughness, metallicity, and shading—from image observations. It addresses the fundamental under-constrained nature of intrinsic image decomposition, especially in multi-view 3D material reconstruction, by leveraging single-view priors, diffusion-based generative models, low-dimensional parametric representations, robust fusion optimization, and physically grounded inverse rendering methods. The approach generalizes both classical single-image intrinsic decomposition and recent deep-learning-driven material estimation to reconstruct globally consistent, high-quality material representations suitable for downstream applications such as relighting and photorealistic rendering (Kocsis et al., 15 Dec 2025, Liu et al., 2018).
1. Overview and Motivation for Intrinsic Image Fusion
Intrinsic image modeling aims to disentangle and recover scene-intrinsic physical factors (notably reflectance/albedo and shading) from image intensities. The classical Bi-illumination Dichromatic Reflection (BIDR) model posits that, for purely diffuse surfaces under combined direct and ambient lighting, observed color at pixel is given by:
where is diffuse albedo, and are RGB vectors of direct and ambient light, and is the direct shading visibility. Decomposing into reflectance and shading thus requires reasoning about multiple unknown, coupled quantities (Liu et al., 2018).
Single-view decomposition is highly under-constrained, prompting recent methods to arbitrarily fix illumination components or rely on strong priors. In multi-view 3D scenarios, ambiguity is further aggravated due to viewpoint-dependent material appearance and geometric uncertainty. Intrinsic Image Fusion (IIF) addresses this by integrating diffusion-based multihypothesis single-view estimation, robust parametric fusion, and inverse rendering refinement, producing globally coherent intrinsic maps that outperform prior baselines in material disentanglement tasks (Kocsis et al., 15 Dec 2025).
2. Single-View Estimation with Diffusion Models
The first stage of IIF employs a diffusion-based generative model, RGBX, trained on paired low-dynamic-range (LDR) RGB images and physically based rendering (PBR) ground truth. For each input image, RGBX produces independent candidate decompositions:
where is the predicted albedo, is roughness, and is metallicity for object at pixel . Each sample is generated by an independent denoising diffusion process, reflecting aleatoric uncertainty inherent in single-view inverse rendering. No additional loss functions are introduced at this stage, as high consistency between samples is assumed to be enforced in subsequent fusion steps (Kocsis et al., 15 Dec 2025).
3. Low-Dimensional Parametric Representation and Calibration
To enable 3D scene-wide material consistency, candidate 2D PBR predictions are encoded in a compact, continuous neural-parametric field , implemented using the hash-grid encoding design of InstantNGP. The network predicts distributions—specifically, parameterized Laplacians—for each scene point:
where are location and scale parameters for albedo, roughness, and metallicity at .
A parametric BRDF affine calibration is then applied to each single-view candidate via: where are learnable affine transformations accounting for global ambiguity within each object-sample pair (Kocsis et al., 15 Dec 2025).
4. Robust Multi-View Fusion and Optimization
Fusion is formulated as a robust distribution-matching problem, aligning the field to the aggregated evidence from all single-view samples. For each object , softmax-weighted per-sample reference Laplacian distributions are computed: with the softmax weights over samples (with temperature annealing).
The principal loss consists of the KL-divergence between predicted and reference Laplacians: Regularizers include a label-loss based on soft pseudo-label fitting, and an affine-identity prior on .
The total objective is
with , , in default settings. Optimization is performed by Adam with batch size and learning rate decayed every two epochs (Kocsis et al., 15 Dec 2025).
5. Inverse Path-Tracing Refinement
After robust fusion, the parametric field yields consistent but potentially globally misaligned PBR parameters due to remaining affine ambiguities and unknown scene illumination. These discrepancies are resolved by an alternating analysis-by-synthesis step using inverse path tracing.
- The rendering equation (Kajiya formulation) with Cook-Torrance microfacet BRDF is solved to match predicted scene appearance to observed images.
- Unknowns include per-object affine transforms , scene lighting parameters, and camera response function (CRF).
- Lighting parameters are optimized first (holding , fixed), followed by cached rendering of shading and a final BRDF parameter fit.
- Regularization terms bias roughness/metallic parameters towards diffuse limits. All optimization is performed with large-batch SGD, leveraging multiple samples per point for stable convergence.
- The physically based pipeline enables recovery of illumination-invariant, high-fidelity textures, sharp shadow boundaries, and accurate specular highlights (Kocsis et al., 15 Dec 2025).
6. Comparative Results and Validation
Intrinsic Image Fusion achieves state-of-the-art performance in multi-view intrinsic material estimation benchmarks, as summarized:
| Method | Albedo PSNR↑ | SSIM↑ | LPIPS↓ | Rough L2↓ | Metal L2↓ | Emit L2↓ |
|---|---|---|---|---|---|---|
| NeILF++ | 13.18 | 0.733 | 0.375 | 0.103 | 0.047 | – |
| FIPT | 10.63 | 0.661 | 0.403 | 0.110 | 0.006 | 2.208 |
| IRIS | 15.86 | 0.735 | 0.307 | 0.056 | 0.040 | 2.046 |
| IIF (Ours) | 20.72 | 0.846 | 0.201 | 0.028 | 0.007 | 0.384 |
Larger sample size (), joint parametric aggregation, and distribution-matching strategies all contribute to improvements in all relevant metrics. IIF outperforms mean-based and per-object aggregation approaches, demonstrating the benefit of probabilistic fusion and parametric embedding (Kocsis et al., 15 Dec 2025).
7. Extensions, Limitations, and Future Directions
Intrinsic Image Fusion demonstrates robustness across both synthetic and real-world datasets, generalizing well to room-scale indoor environments. Nevertheless, certain limitations persist:
- The low-dimensional neural field assumes piecewise-smooth material distributions, which may be challenged by extremely high-frequency texture or non-Lambertian effects.
- While fusion mitigates per-view ambiguity, it remains sensitive to systematic errors in single-view priors when all candidate samples share the same bias.
- Inverse path-tracing refinement is computationally demanding, with full pipelines requiring approximately 60 minutes on contemporary high-end GPUs.
Potential future research directions include expansion to handle multiple, colored direct illuminants, explicit modeling of specular and interreflection terms within the intrinsic decomposition, pre-processing to extract texture layers for improved clustering, and unrolled learning-based analogs of fusion and refinement for accelerated inference (Liu et al., 2018, Kocsis et al., 15 Dec 2025).
Intrinsic Image Fusion provides a scalable, extensible, and high-accuracy framework for the joint estimation of scene-intrinsic properties across views, integrating advances in diffusion modeling, low-dimensional neural representations, robust multi-view optimization, and physically grounded rendering.