Papers
Topics
Authors
Recent
Search
2000 character limit reached

Augmented Latent Intrinsics (ALI)

Updated 8 February 2026
  • Augmented Latent Intrinsics (ALI) is a framework for image relighting that disentangles scene-intrinsic properties from illumination using dense, pixel-aligned priors and self-supervised objectives.
  • It employs a two-stream encoder and a staged self-supervised training protocol to balance semantic abstraction with photometric fidelity on complex materials.
  • Empirical results demonstrate that ALI outperforms previous methods, achieving improved RMSE, SSIM, and PSNR metrics, particularly on specular and non-diffuse surfaces.

Augmented Latent Intrinsics (ALI) constitutes a framework for image-to-image relighting, aiming to disentangle scene-intrinsic properties from illumination using dense pixel-aligned priors and self-supervised objectives. Unlike classical inverse-graphics pipelines that seek explicit recovery of scene albedo, normals, and shading, or purely latent-intrinsic methods that operate over entangled feature spaces, ALI achieves photometrically faithful image manipulations by fusing hierarchically structured visual priors with learned latent representations. Empirical studies of ALI reveal a fundamental trade-off between semantic abstraction and photometric fidelity: leveraging high-level semantic encoders, a common strategy in vision-language and contrastive representation learning, can degrade performance in physically grounded tasks such as relighting, as critical fine-grained photometric cues are lost. ALI overcomes this by integrating dense, pixel-level feature backbones and applying a staged, self-supervised refinement protocol, achieving robust improvements on complex, view-dependent materials (Xing et al., 1 Feb 2026).

1. Problem Formulation and Motivation

Image relighting, in the context of ALI, seeks to generate a new image I^\hat I of a scene ss under target illumination â„“2\ell_2 given a source image Isâ„“1I_s^{\ell_1} under illumination â„“1\ell_1. The pipeline is formalized as:

I^sℓ1→ℓ2=G(Eintr(Isℓ1),  Elight(Isℓ2))\hat I_s^{\ell_1\to\ell_2} = \mathcal{G}(\mathcal{E}_{\rm intr}(I_s^{\ell_1}),\; \mathcal{E}_{\rm light}(I_s^{\ell_2}))

where Eintr\mathcal{E}_{\rm intr} produces lighting-invariant "intrinsic" features, Elight\mathcal{E}_{\rm light} outputs a global lighting code, and G\mathcal{G} denotes the learned decoder.

In contrast to traditional inverse-graphics which reconstructs explicit maps (albedo A(x)A(x), normal N(x)N(x), shading S(x;â„“)S(x;\ell)), latent-intrinsic methods instead use hierarchical latent features {zs,iâ„“}i=1N\{z_{s,i}^{\ell}\}_{i=1}^N (intrinsics) and a vector zsâ„“z_s^\ell (lighting), such that

I^sℓ1→ℓ2=ϕ({zs,iℓ1}, zsℓ2).\hat I_s^{\ell_1\to\ell_2} = \phi(\{z_{s,i}^{\ell_1}\},\,z_s^{\ell_2}).

Failure modes of purely latent-intrinsic approaches, especially with limited real multi-illumination data, are severe on scenes with strong view-dependent reflectance (e.g., metals, glass): specularities are often misattributed or blurred. The naive hypothesis that stronger semantic encoders (e.g., DINO, CLIP) would resolve these ambiguities is not supported by evidence—in fact, these features induce a loss of photometric granularity crucial for relighting (Xing et al., 1 Feb 2026).

2. Mathematical Structure and Training Objectives

ALI is structured around two encoder streams and a diffusion decoder. Given input Isℓ∈RH×W×3I_s^{\ell}\in\mathbb{R}^{H\times W\times 3}, the feature decomposition is:

  • Intrinsic features {zs,iâ„“}i=1N\{z_{s,i}^{\ell}\}_{i=1}^N (albedo/geometry-like)
  • Lighting code zsâ„“z_s^{\ell}

The relighting function is:

I^sℓ1→ℓ2=ϕ({zs,iℓ1}, zsℓ2)\hat I_s^{\ell_1\to\ell_2} = \phi(\{z_{s,i}^{\ell_1}\},\,z_s^{\ell_2})

Training involves minimizing:

  • Reconstruction fidelity: Lrelight=Es,â„“1,â„“2∥Isâ„“2−I^sâ„“1→ℓ2∥22\mathcal{L}_{\rm relight} = \mathbb{E}_{s,\ell_1,\ell_2}\|I_s^{\ell_2} - \hat I_s^{\ell_1\to\ell_2}\|_2^2
  • Lighting invariance: Linv=∑s∑m=1M∥zs,iâ„“m−1M∑m′=1Mzs,iâ„“m′∥2\mathcal{L}_{\rm inv} = \sum_{s}\sum_{m=1}^M \|z_{s,i}^{\ell_m} - \frac{1}{M}\sum_{m'=1}^M z_{s,i}^{\ell_{m'}}\|_2
  • Hyperspherical regularization: to enforce uniform feature coverage

These constraints are orchestrated to anchor the intrinsic representation while ensuring relighting generalizes across lighting conditions.

3. Architecture and Pixel-Aligned Feature Fusion

ALI maintains a two-stream encoder architecture. The "semantic" stream Esem\mathcal{E}_{\rm sem} is a frozen, pixel-aligned visual backbone (RADIOv2.5H or MAE), from which a hierarchy of feature maps {Fs,iâ„“}i=1N\{F_{s,i}^\ell\}_{i=1}^N is extracted. Each feature map is upsampled to input resolution and concatenated into a per-pixel hypercolumn:

Hsâ„“(x,y)=Concati[Up(Fs,iâ„“)(x,y)]H_s^\ell(x,y) = \mathrm{Concat}_i[\mathrm{Up}(F_{s,i}^\ell)(x,y)]

A lightweight projection module performs additive fusion into the original intrinsic features:

z~s,iℓ(x,y)=zs,iℓ(x,y)+Projθ′(Hsℓ(x,y))\tilde z_{s,i}^\ell(x,y) = z_{s,i}^\ell(x,y) + \text{Proj}_{\theta'}(H_s^\ell(x,y))

This mechanism injects dense semantic and photometric information directly at the pixel level, carefully balancing the contextual coverage of the backbone with preservation of high-frequency image structure. Experiments show that while contrastive/semantic encoders (CLIP, DINO) slightly benefit downstream performance, dense reconstructive priors (RADIO, MAE) yield significantly superior relighting accuracy on photometrically challenging surfaces (Xing et al., 1 Feb 2026).

4. Self-Supervised Refinement and Staged Training

ALI employs a three-stage training protocol, mitigating the scarcity of paired real-world relighting data:

  • Stage I: Train encoder fusion (freeze Esem\mathcal{E}_{\rm sem} and decoder; learn intrinsic encoder and projection)
  • Stage II: Decoder alignment (freeze encoders; fine-tune diffusion decoder)
  • Stage III: Self-supervised fine-tuning using a "Lighting Zoo"—synthetic pseudo-pairs sampled from batches where the model's own relighting serves as pseudo-ground truth. The denoising score-matching objective is used:

Ldenoise=Et,ϵ∥ϕ(αtzb+βtϵ,{z~a,i},zb)−ϵ∥22\mathcal{L}_{\rm denoise} = \mathbb{E}_{t,\epsilon}\|\phi(\alpha_t z_b + \beta_t \epsilon, \{\tilde z_{a,i}\}, z_b) - \epsilon\|_2^2

Occasional identity relighting steps are mixed to preserve scene content.

Key datasets include MIT MIIW (985 scenes × 25 illuminations) and BigTime (460 scenes × 20–50 illuminations).

5. Empirical Results and Quantitative Analysis

ALI achieves state-of-the-art results in unsupervised relighting benchmarks, especially on scenes with non-diffuse, specular, or metallic materials—categories where semantic context and dense priors are critical. On the MIIW cross-scene benchmark:

  • RMSE: 0.294 (improved over LumiNet's 0.310)
  • SSIM: 0.464 (vs. LumiNet's 0.440)

In in-scene relighting:

  • PSNR: 18.87
  • RMSE: 0.119
  • LPIPS: 0.213
  • SSIM: 0.671

Material-wise breakdown indicates an approximate 6% improvement in SSIM for non-diffuse categories (metal/glass). Qualitative assessments confirm higher sharpness in specular highlights, improved caustics, and analytically correct shadow placements compared to prior art (SA-AE, Latent-Intrinsics, RGB↔X, LumiNet). Ablations demonstrate:

  • Minor or negative impact from high-level semantic encoders (CLIP, DINO)
  • Significant performance gain from dense reconstructive priors (RADIOv2.5H, MAE), with RADIOv2.5H giving the best scores (e.g., PSNR↑18.34, SSIM↑0.596, RMSE↓0.126)
  • Stage-wise training improves geometric fidelity, specular-dynamic quality, and in-the-wild artifact removal sequentially

6. Analysis of the Semantic–Photometric Trade-Off

Experimental results reveal a counter-intuitive phenomenon: increasing the strength of semantic encoder priors degrades relighting performance. Semantic encoders, optimized for invariance and abstraction (e.g., CLIP, DINO), tend to remove the very pixel-level photometric structures necessary for physically plausible relighting. Dense reconstructive backbones such as RADIO and MAE, in contrast, preserve pixel-aligned cues vital for reconstructing directional shadows, specularities, and subtle caustic phenomena. This trade-off, established through both quantitative and ablation analyses, argues against reflexive application of large semantic vision encoders for generative inverse problems involving fine-grained physics (Xing et al., 1 Feb 2026).

7. Limitations and Prospective Directions

Current ALI models are limited by reliance on learned priors rather than explicit 3D geometry. Subtle global effects—caustics, interreflections, or fine-scale albedo variation—may be blurred or misattributed under challenging conditions. Further, ALI can confuse minor albedo differences with illumination, especially with highly atypical materials. Future research avenues include:

  • Integrating single-view geometry estimation into the intrinsic inference stream
  • Leveraging multi-view, view-consistent data to improve physically plausible disentanglement
  • Extending probing methods to other inverse graphics tasks (e.g., explicit reflectance editing, HDR relighting)
  • Systematically clarifying which visual priors optimally support downstream generative tasks

ALI establishes that maximal semantic abstraction is not always compatible with photometric fidelity. Its hybrid approach—merging pixel-aligned visual priors with hierarchical latent intrinsics under self-supervised, multi-stage optimization—offers a robust template for physically grounded generative modeling, particularly in regimes characterized by view-dependent, specular, or complex materials (Xing et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Augmented Latent Intrinsics (ALI).