Augmented Latent Intrinsics (ALI)
- Augmented Latent Intrinsics (ALI) is a framework for image relighting that disentangles scene-intrinsic properties from illumination using dense, pixel-aligned priors and self-supervised objectives.
- It employs a two-stream encoder and a staged self-supervised training protocol to balance semantic abstraction with photometric fidelity on complex materials.
- Empirical results demonstrate that ALI outperforms previous methods, achieving improved RMSE, SSIM, and PSNR metrics, particularly on specular and non-diffuse surfaces.
Augmented Latent Intrinsics (ALI) constitutes a framework for image-to-image relighting, aiming to disentangle scene-intrinsic properties from illumination using dense pixel-aligned priors and self-supervised objectives. Unlike classical inverse-graphics pipelines that seek explicit recovery of scene albedo, normals, and shading, or purely latent-intrinsic methods that operate over entangled feature spaces, ALI achieves photometrically faithful image manipulations by fusing hierarchically structured visual priors with learned latent representations. Empirical studies of ALI reveal a fundamental trade-off between semantic abstraction and photometric fidelity: leveraging high-level semantic encoders, a common strategy in vision-language and contrastive representation learning, can degrade performance in physically grounded tasks such as relighting, as critical fine-grained photometric cues are lost. ALI overcomes this by integrating dense, pixel-level feature backbones and applying a staged, self-supervised refinement protocol, achieving robust improvements on complex, view-dependent materials (Xing et al., 1 Feb 2026).
1. Problem Formulation and Motivation
Image relighting, in the context of ALI, seeks to generate a new image of a scene under target illumination given a source image under illumination . The pipeline is formalized as:
where produces lighting-invariant "intrinsic" features, outputs a global lighting code, and denotes the learned decoder.
In contrast to traditional inverse-graphics which reconstructs explicit maps (albedo , normal , shading ), latent-intrinsic methods instead use hierarchical latent features (intrinsics) and a vector (lighting), such that
Failure modes of purely latent-intrinsic approaches, especially with limited real multi-illumination data, are severe on scenes with strong view-dependent reflectance (e.g., metals, glass): specularities are often misattributed or blurred. The naive hypothesis that stronger semantic encoders (e.g., DINO, CLIP) would resolve these ambiguities is not supported by evidence—in fact, these features induce a loss of photometric granularity crucial for relighting (Xing et al., 1 Feb 2026).
2. Mathematical Structure and Training Objectives
ALI is structured around two encoder streams and a diffusion decoder. Given input , the feature decomposition is:
- Intrinsic features (albedo/geometry-like)
- Lighting code
The relighting function is:
Training involves minimizing:
- Reconstruction fidelity:
- Lighting invariance:
- Hyperspherical regularization: to enforce uniform feature coverage
These constraints are orchestrated to anchor the intrinsic representation while ensuring relighting generalizes across lighting conditions.
3. Architecture and Pixel-Aligned Feature Fusion
ALI maintains a two-stream encoder architecture. The "semantic" stream is a frozen, pixel-aligned visual backbone (RADIOv2.5H or MAE), from which a hierarchy of feature maps is extracted. Each feature map is upsampled to input resolution and concatenated into a per-pixel hypercolumn:
A lightweight projection module performs additive fusion into the original intrinsic features:
This mechanism injects dense semantic and photometric information directly at the pixel level, carefully balancing the contextual coverage of the backbone with preservation of high-frequency image structure. Experiments show that while contrastive/semantic encoders (CLIP, DINO) slightly benefit downstream performance, dense reconstructive priors (RADIO, MAE) yield significantly superior relighting accuracy on photometrically challenging surfaces (Xing et al., 1 Feb 2026).
4. Self-Supervised Refinement and Staged Training
ALI employs a three-stage training protocol, mitigating the scarcity of paired real-world relighting data:
- Stage I: Train encoder fusion (freeze and decoder; learn intrinsic encoder and projection)
- Stage II: Decoder alignment (freeze encoders; fine-tune diffusion decoder)
- Stage III: Self-supervised fine-tuning using a "Lighting Zoo"—synthetic pseudo-pairs sampled from batches where the model's own relighting serves as pseudo-ground truth. The denoising score-matching objective is used:
Occasional identity relighting steps are mixed to preserve scene content.
Key datasets include MIT MIIW (985 scenes × 25 illuminations) and BigTime (460 scenes × 20–50 illuminations).
5. Empirical Results and Quantitative Analysis
ALI achieves state-of-the-art results in unsupervised relighting benchmarks, especially on scenes with non-diffuse, specular, or metallic materials—categories where semantic context and dense priors are critical. On the MIIW cross-scene benchmark:
- RMSE: 0.294 (improved over LumiNet's 0.310)
- SSIM: 0.464 (vs. LumiNet's 0.440)
In in-scene relighting:
- PSNR: 18.87
- RMSE: 0.119
- LPIPS: 0.213
- SSIM: 0.671
Material-wise breakdown indicates an approximate 6% improvement in SSIM for non-diffuse categories (metal/glass). Qualitative assessments confirm higher sharpness in specular highlights, improved caustics, and analytically correct shadow placements compared to prior art (SA-AE, Latent-Intrinsics, RGB↔X, LumiNet). Ablations demonstrate:
- Minor or negative impact from high-level semantic encoders (CLIP, DINO)
- Significant performance gain from dense reconstructive priors (RADIOv2.5H, MAE), with RADIOv2.5H giving the best scores (e.g., PSNR↑18.34, SSIM↑0.596, RMSE↓0.126)
- Stage-wise training improves geometric fidelity, specular-dynamic quality, and in-the-wild artifact removal sequentially
6. Analysis of the Semantic–Photometric Trade-Off
Experimental results reveal a counter-intuitive phenomenon: increasing the strength of semantic encoder priors degrades relighting performance. Semantic encoders, optimized for invariance and abstraction (e.g., CLIP, DINO), tend to remove the very pixel-level photometric structures necessary for physically plausible relighting. Dense reconstructive backbones such as RADIO and MAE, in contrast, preserve pixel-aligned cues vital for reconstructing directional shadows, specularities, and subtle caustic phenomena. This trade-off, established through both quantitative and ablation analyses, argues against reflexive application of large semantic vision encoders for generative inverse problems involving fine-grained physics (Xing et al., 1 Feb 2026).
7. Limitations and Prospective Directions
Current ALI models are limited by reliance on learned priors rather than explicit 3D geometry. Subtle global effects—caustics, interreflections, or fine-scale albedo variation—may be blurred or misattributed under challenging conditions. Further, ALI can confuse minor albedo differences with illumination, especially with highly atypical materials. Future research avenues include:
- Integrating single-view geometry estimation into the intrinsic inference stream
- Leveraging multi-view, view-consistent data to improve physically plausible disentanglement
- Extending probing methods to other inverse graphics tasks (e.g., explicit reflectance editing, HDR relighting)
- Systematically clarifying which visual priors optimally support downstream generative tasks
ALI establishes that maximal semantic abstraction is not always compatible with photometric fidelity. Its hybrid approach—merging pixel-aligned visual priors with hierarchical latent intrinsics under self-supervised, multi-stage optimization—offers a robust template for physically grounded generative modeling, particularly in regimes characterized by view-dependent, specular, or complex materials (Xing et al., 1 Feb 2026).