Augmented Latent Intrinsics (ALI)

Updated 8 February 2026

Augmented Latent Intrinsics (ALI) is a framework for image relighting that disentangles scene-intrinsic properties from illumination using dense, pixel-aligned priors and self-supervised objectives.
It employs a two-stream encoder and a staged self-supervised training protocol to balance semantic abstraction with photometric fidelity on complex materials.
Empirical results demonstrate that ALI outperforms previous methods, achieving improved RMSE, SSIM, and PSNR metrics, particularly on specular and non-diffuse surfaces.

Augmented Latent Intrinsics (ALI) constitutes a framework for image-to-image relighting, aiming to disentangle scene-intrinsic properties from illumination using dense pixel-aligned priors and self-supervised objectives. Unlike classical inverse-graphics pipelines that seek explicit recovery of scene albedo, normals, and shading, or purely latent-intrinsic methods that operate over entangled feature spaces, ALI achieves photometrically faithful image manipulations by fusing hierarchically structured visual priors with learned latent representations. Empirical studies of ALI reveal a fundamental trade-off between semantic abstraction and photometric fidelity: leveraging high-level semantic encoders, a common strategy in vision-language and contrastive representation learning, can degrade performance in physically grounded tasks such as relighting, as critical fine-grained photometric cues are lost. ALI overcomes this by integrating dense, pixel-level feature backbones and applying a staged, self-supervised refinement protocol, achieving robust improvements on complex, view-dependent materials (Xing et al., 1 Feb 2026).

1. Problem Formulation and Motivation

Image relighting, in the context of ALI, seeks to generate a new image $\hat I$ of a scene $s$ under target illumination $\ell_2$ given a source image $I_s^{\ell_1}$ under illumination $\ell_1$ . The pipeline is formalized as:

$\hat I_s^{\ell_1\to\ell_2} = \mathcal{G}(\mathcal{E}_{\rm intr}(I_s^{\ell_1}),\; \mathcal{E}_{\rm light}(I_s^{\ell_2}))$

where $\mathcal{E}_{\rm intr}$ produces lighting-invariant "intrinsic" features, $\mathcal{E}_{\rm light}$ outputs a global lighting code, and $\mathcal{G}$ denotes the learned decoder.

In contrast to traditional inverse-graphics which reconstructs explicit maps (albedo $A(x)$ , normal $N(x)$ , shading $S(x;\ell)$ ), latent-intrinsic methods instead use hierarchical latent features $\{z_{s,i}^{\ell}\}_{i=1}^N$ (intrinsics) and a vector $z_s^\ell$ (lighting), such that

$\hat I_s^{\ell_1\to\ell_2} = \phi(\{z_{s,i}^{\ell_1}\},\,z_s^{\ell_2}).$

Failure modes of purely latent-intrinsic approaches, especially with limited real multi-illumination data, are severe on scenes with strong view-dependent reflectance (e.g., metals, glass): specularities are often misattributed or blurred. The naive hypothesis that stronger semantic encoders (e.g., DINO, CLIP) would resolve these ambiguities is not supported by evidence—in fact, these features induce a loss of photometric granularity crucial for relighting (Xing et al., 1 Feb 2026).

2. Mathematical Structure and Training Objectives

ALI is structured around two encoder streams and a diffusion decoder. Given input $I_s^{\ell}\in\mathbb{R}^{H\times W\times 3}$ , the feature decomposition is:

Intrinsic features $\{z_{s,i}^{\ell}\}_{i=1}^N$ (albedo/geometry-like)
Lighting code $z_s^{\ell}$

The relighting function is:

$\hat I_s^{\ell_1\to\ell_2} = \phi(\{z_{s,i}^{\ell_1}\},\,z_s^{\ell_2})$

Training involves minimizing:

Reconstruction fidelity: $\mathcal{L}_{\rm relight} = \mathbb{E}_{s,\ell_1,\ell_2}\|I_s^{\ell_2} - \hat I_s^{\ell_1\to\ell_2}\|_2^2$
Lighting invariance: $\mathcal{L}_{\rm inv} = \sum_{s}\sum_{m=1}^M \|z_{s,i}^{\ell_m} - \frac{1}{M}\sum_{m'=1}^M z_{s,i}^{\ell_{m'}}\|_2$
Hyperspherical regularization: to enforce uniform feature coverage

These constraints are orchestrated to anchor the intrinsic representation while ensuring relighting generalizes across lighting conditions.

3. Architecture and Pixel-Aligned Feature Fusion

ALI maintains a two-stream encoder architecture. The "semantic" stream $\mathcal{E}_{\rm sem}$ is a frozen, pixel-aligned visual backbone (RADIOv2.5H or MAE), from which a hierarchy of feature maps $\{F_{s,i}^\ell\}_{i=1}^N$ is extracted. Each feature map is upsampled to input resolution and concatenated into a per-pixel hypercolumn:

$H_s^\ell(x,y) = \mathrm{Concat}_i[\mathrm{Up}(F_{s,i}^\ell)(x,y)]$

A lightweight projection module performs additive fusion into the original intrinsic features:

$\tilde z_{s,i}^\ell(x,y) = z_{s,i}^\ell(x,y) + \text{Proj}_{\theta'}(H_s^\ell(x,y))$

This mechanism injects dense semantic and photometric information directly at the pixel level, carefully balancing the contextual coverage of the backbone with preservation of high-frequency image structure. Experiments show that while contrastive/semantic encoders (CLIP, DINO) slightly benefit downstream performance, dense reconstructive priors (RADIO, MAE) yield significantly superior relighting accuracy on photometrically challenging surfaces (Xing et al., 1 Feb 2026).

ALI employs a three-stage training protocol, mitigating the scarcity of paired real-world relighting data:

Stage I: Train encoder fusion (freeze $\mathcal{E}_{\rm sem}$ and decoder; learn intrinsic encoder and projection)
Stage II: Decoder alignment (freeze encoders; fine-tune diffusion decoder)
Stage III: Self-supervised fine-tuning using a "Lighting Zoo"—synthetic pseudo-pairs sampled from batches where the model's own relighting serves as pseudo-ground truth. The denoising score-matching objective is used:

$\mathcal{L}_{\rm denoise} = \mathbb{E}_{t,\epsilon}\|\phi(\alpha_t z_b + \beta_t \epsilon, \{\tilde z_{a,i}\}, z_b) - \epsilon\|_2^2$

Occasional identity relighting steps are mixed to preserve scene content.

Key datasets include MIT MIIW (985 scenes × 25 illuminations) and BigTime (460 scenes × 20–50 illuminations).

5. Empirical Results and Quantitative Analysis

ALI achieves state-of-the-art results in unsupervised relighting benchmarks, especially on scenes with non-diffuse, specular, or metallic materials—categories where semantic context and dense priors are critical. On the MIIW cross-scene benchmark:

RMSE: 0.294 (improved over LumiNet's 0.310)
SSIM: 0.464 (vs. LumiNet's 0.440)

In in-scene relighting:

PSNR: 18.87
RMSE: 0.119
LPIPS: 0.213
SSIM: 0.671

Material-wise breakdown indicates an approximate 6% improvement in SSIM for non-diffuse categories (metal/glass). Qualitative assessments confirm higher sharpness in specular highlights, improved caustics, and analytically correct shadow placements compared to prior art (SA-AE, Latent-Intrinsics, RGB↔X, LumiNet). Ablations demonstrate:

Minor or negative impact from high-level semantic encoders (CLIP, DINO)
Significant performance gain from dense reconstructive priors (RADIOv2.5H, MAE), with RADIOv2.5H giving the best scores (e.g., PSNR↑18.34, SSIM↑0.596, RMSE↓0.126)
Stage-wise training improves geometric fidelity, specular-dynamic quality, and in-the-wild artifact removal sequentially

6. Analysis of the Semantic–Photometric Trade-Off

Experimental results reveal a counter-intuitive phenomenon: increasing the strength of semantic encoder priors degrades relighting performance. Semantic encoders, optimized for invariance and abstraction (e.g., CLIP, DINO), tend to remove the very pixel-level photometric structures necessary for physically plausible relighting. Dense reconstructive backbones such as RADIO and MAE, in contrast, preserve pixel-aligned cues vital for reconstructing directional shadows, specularities, and subtle caustic phenomena. This trade-off, established through both quantitative and ablation analyses, argues against reflexive application of large semantic vision encoders for generative inverse problems involving fine-grained physics (Xing et al., 1 Feb 2026).

7. Limitations and Prospective Directions

Current ALI models are limited by reliance on learned priors rather than explicit 3D geometry. Subtle global effects—caustics, interreflections, or fine-scale albedo variation—may be blurred or misattributed under challenging conditions. Further, ALI can confuse minor albedo differences with illumination, especially with highly atypical materials. Future research avenues include:

Integrating single-view geometry estimation into the intrinsic inference stream
Leveraging multi-view, view-consistent data to improve physically plausible disentanglement
Extending probing methods to other inverse graphics tasks (e.g., explicit reflectance editing, HDR relighting)
Systematically clarifying which visual priors optimally support downstream generative tasks

ALI establishes that maximal semantic abstraction is not always compatible with photometric fidelity. Its hybrid approach—merging pixel-aligned visual priors with hierarchical latent intrinsics under self-supervised, multi-stage optimization—offers a robust template for physically grounded generative modeling, particularly in regimes characterized by view-dependent, specular, or complex materials (Xing et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Stronger Semantic Encoders Can Harm Relighting Performance: Probing Visual Priors via Augmented Latent Intrinsics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Augmented Latent Intrinsics (ALI).

Augmented Latent Intrinsics (ALI)

1. Problem Formulation and Motivation

2. Mathematical Structure and Training Objectives

3. Architecture and Pixel-Aligned Feature Fusion

4. Self-Supervised Refinement and Staged Training

5. Empirical Results and Quantitative Analysis

6. Analysis of the Semantic–Photometric Trade-Off

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Augmented Latent Intrinsics (ALI)

1. Problem Formulation and Motivation

2. Mathematical Structure and Training Objectives

3. Architecture and Pixel-Aligned Feature Fusion

4. Self-Supervised Refinement and Staged Training

5. Empirical Results and Quantitative Analysis

6. Analysis of the Semantic–Photometric Trade-Off

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research