Papers
Topics
Authors
Recent
2000 character limit reached

Intrinsic Image Fusion

Updated 18 December 2025
  • Intrinsic Image Fusion is a computational technique that disentangles scene properties like albedo, roughness, and shading from images using single-view priors and multi-view fusion.
  • It integrates diffusion-based generative models, low-dimensional parametric representations, and robust optimization to produce globally consistent material estimates.
  • The approach achieves state-of-the-art performance in 3D material reconstruction, delivering improved PSNR, SSIM, and reduced error metrics compared to traditional methods.

Intrinsic Image Fusion is an advanced computational strategy for recovering physically meaningful scene properties—such as albedo, roughness, metallicity, and shading—from image observations. It addresses the fundamental under-constrained nature of intrinsic image decomposition, especially in multi-view 3D material reconstruction, by leveraging single-view priors, diffusion-based generative models, low-dimensional parametric representations, robust fusion optimization, and physically grounded inverse rendering methods. The approach generalizes both classical single-image intrinsic decomposition and recent deep-learning-driven material estimation to reconstruct globally consistent, high-quality material representations suitable for downstream applications such as relighting and photorealistic rendering (Kocsis et al., 15 Dec 2025, Liu et al., 2018).

1. Overview and Motivation for Intrinsic Image Fusion

Intrinsic image modeling aims to disentangle and recover scene-intrinsic physical factors (notably reflectance/albedo and shading) from image intensities. The classical Bi-illumination Dichromatic Reflection (BIDR) model posits that, for purely diffuse surfaces under combined direct and ambient lighting, observed color at pixel pp is given by:

Ii(p)=bi(p)⋅[γ(p) di+ai]I^i(p) = b^i(p) \cdot \Big[\gamma(p)\, d^i + a^i\Big]

where b(p)b(p) is diffuse albedo, dd and aa are RGB vectors of direct and ambient light, and γ(p)\gamma(p) is the direct shading visibility. Decomposing II into reflectance RR and shading SS thus requires reasoning about multiple unknown, coupled quantities (Liu et al., 2018).

Single-view decomposition is highly under-constrained, prompting recent methods to arbitrarily fix illumination components or rely on strong priors. In multi-view 3D scenarios, ambiguity is further aggravated due to viewpoint-dependent material appearance and geometric uncertainty. Intrinsic Image Fusion (IIF) addresses this by integrating diffusion-based multihypothesis single-view estimation, robust parametric fusion, and inverse rendering refinement, producing globally coherent intrinsic maps that outperform prior baselines in material disentanglement tasks (Kocsis et al., 15 Dec 2025).

2. Single-View Estimation with Diffusion Models

The first stage of IIF employs a diffusion-based generative model, RGBX, trained on paired low-dynamic-range (LDR) RGB images and physically based rendering (PBR) ground truth. For each input image, RGBX produces KK independent candidate decompositions:

{(ai,k(u,v), ri,k(u,v), mi,k(u,v)) ∣ k=1...K}\{(a_{i,k}(u,v),\, r_{i,k}(u,v),\, m_{i,k}(u,v))\,|\, k=1...K\}

where ai,ka_{i,k} is the predicted albedo, ri,kr_{i,k} is roughness, and mi,km_{i,k} is metallicity for object ii at pixel (u,v)(u,v). Each sample is generated by an independent denoising diffusion process, reflecting aleatoric uncertainty inherent in single-view inverse rendering. No additional loss functions are introduced at this stage, as high consistency between samples is assumed to be enforced in subsequent fusion steps (Kocsis et al., 15 Dec 2025).

3. Low-Dimensional Parametric Representation and Calibration

To enable 3D scene-wide material consistency, candidate 2D PBR predictions are encoded in a compact, continuous neural-parametric field fθ(x):R3→R(3+3)×2f_\theta(x): \mathbb{R}^3 \rightarrow \mathbb{R}^{(3+3)\times 2}, implemented using the hash-grid encoding design of InstantNGP. The network predicts distributions—specifically, parameterized Laplacians—for each scene point:

fθ(xn)=(μna,bna),(μnr,bnr),(μnm,bnm)f_\theta(x_n) = (\mu^a_n, b^a_n), (\mu^r_n, b^r_n), (\mu^m_n, b^m_n)

where (μn∗,bn∗)(\mu^{*}_n, b^{*}_n) are location and scale parameters for albedo, roughness, and metallicity at xnx_n.

A parametric BRDF affine calibration is then applied to each single-view candidate via: aˉi,k(u,v)=Ti,ka[ai,k(u,v) 1],rˉi,k=Ti,kr[ri,k;1],mˉi,k=Ti,km[mi,k;1]\bar{a}_{i,k}(u,v) = T^a_{i,k} \begin{bmatrix} a_{i,k}(u,v) \ 1 \end{bmatrix}, \quad \bar{r}_{i,k} = T^r_{i,k}[r_{i,k}; 1], \quad \bar{m}_{i,k} = T^m_{i,k}[m_{i,k}; 1] where Ti,k∗T^*_{i,k} are learnable affine transformations accounting for global ambiguity within each object-sample pair (Kocsis et al., 15 Dec 2025).

4. Robust Multi-View Fusion and Optimization

Fusion is formulated as a robust distribution-matching problem, aligning the field fθf_\theta to the aggregated evidence from all single-view samples. For each object ii, softmax-weighted per-sample reference Laplacian distributions are computed: μiref=[∑kαi,ka aˉi,k, ...],biref=mediank∣[aˉi,k,rˉi,k,mˉi,k]−μiref∣\mu^\text{ref}_i = \left[\sum_k \alpha^a_{i,k}\,\bar{a}_{i,k},\ ...\right], \quad b^\text{ref}_i = \mathrm{median}_k \left| [\bar{a}_{i,k}, \bar{r}_{i,k}, \bar{m}_{i,k}] - \mu^\text{ref}_i \right| with αi,k∗\alpha^*_{i,k} the softmax weights over KK samples (with temperature annealing).

The principal loss consists of the KL-divergence between predicted and reference Laplacians: Ldata=1N∑n=1NDKL[pinref  ∥  pnpred]L_\text{data} = \frac{1}{N} \sum_{n=1}^N D_\text{KL}\left[p^\text{ref}_{i_n} \;\|\; p^\text{pred}_n\right] Regularizers include a label-loss based on soft pseudo-label fitting, and an affine-identity prior on TT.

The total objective is

Ltotal(θ,z,T)=wdata Ldata+wlabel Llabel+wreg LregL_\text{total}(\theta,z,T) = w_\text{data}\, L_\text{data} + w_\text{label}\, L_\text{label} + w_\text{reg}\, L_\text{reg}

with wdata=1w_\text{data}=1, wlabel=1w_\text{label}=1, wreg=100w_\text{reg}=100 in default settings. Optimization is performed by Adam with batch size N=65536N=65536 and learning rate decayed every two epochs (Kocsis et al., 15 Dec 2025).

5. Inverse Path-Tracing Refinement

After robust fusion, the parametric field fθf_\theta yields consistent but potentially globally misaligned PBR parameters due to remaining affine ambiguities and unknown scene illumination. These discrepancies are resolved by an alternating analysis-by-synthesis step using inverse path tracing.

  • The rendering equation (Kajiya formulation) with Cook-Torrance microfacet BRDF is solved to match predicted scene appearance to observed images.
  • Unknowns include per-object affine transforms (Toa,Tor,Tom)(T^a_o, T^r_o, T^m_o), scene lighting parameters, and camera response function (CRF).
  • Lighting parameters are optimized first (holding fθf_\theta, TT fixed), followed by cached rendering of shading and a final BRDF parameter fit.
  • Regularization terms bias roughness/metallic parameters towards diffuse limits. All optimization is performed with large-batch SGD, leveraging multiple samples per point for stable convergence.
  • The physically based pipeline enables recovery of illumination-invariant, high-fidelity textures, sharp shadow boundaries, and accurate specular highlights (Kocsis et al., 15 Dec 2025).

6. Comparative Results and Validation

Intrinsic Image Fusion achieves state-of-the-art performance in multi-view intrinsic material estimation benchmarks, as summarized:

Method Albedo PSNR↑ SSIM↑ LPIPS↓ Rough L2↓ Metal L2↓ Emit L2↓
NeILF++ 13.18 0.733 0.375 0.103 0.047 –
FIPT 10.63 0.661 0.403 0.110 0.006 2.208
IRIS 15.86 0.735 0.307 0.056 0.040 2.046
IIF (Ours) 20.72 0.846 0.201 0.028 0.007 0.384

Larger sample size (KK), joint parametric aggregation, and distribution-matching strategies all contribute to improvements in all relevant metrics. IIF outperforms mean-based and per-object aggregation approaches, demonstrating the benefit of probabilistic fusion and parametric embedding (Kocsis et al., 15 Dec 2025).

7. Extensions, Limitations, and Future Directions

Intrinsic Image Fusion demonstrates robustness across both synthetic and real-world datasets, generalizing well to room-scale indoor environments. Nevertheless, certain limitations persist:

  • The low-dimensional neural field assumes piecewise-smooth material distributions, which may be challenged by extremely high-frequency texture or non-Lambertian effects.
  • While fusion mitigates per-view ambiguity, it remains sensitive to systematic errors in single-view priors when all candidate samples share the same bias.
  • Inverse path-tracing refinement is computationally demanding, with full pipelines requiring approximately 60 minutes on contemporary high-end GPUs.

Potential future research directions include expansion to handle multiple, colored direct illuminants, explicit modeling of specular and interreflection terms within the intrinsic decomposition, pre-processing to extract texture layers for improved clustering, and unrolled learning-based analogs of fusion and refinement for accelerated inference (Liu et al., 2018, Kocsis et al., 15 Dec 2025).

Intrinsic Image Fusion provides a scalable, extensible, and high-accuracy framework for the joint estimation of scene-intrinsic properties across views, integrating advances in diffusion modeling, low-dimensional neural representations, robust multi-view optimization, and physically grounded rendering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Intrinsic Image Fusion.