UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting (2506.15673v1)

Published 18 Jun 2025 in cs.CV

Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

Summary

The paper presents a unified model that jointly estimates albedo and synthesizes relit frames, eliminating error accumulation from two-stage pipelines.
It leverages video diffusion models with hybrid training on synthetic and real-world data to improve scene understanding and material rendering.
Quantitative and qualitative results demonstrate superior realism, temporal consistency, and computational efficiency over existing methods.

UniRelight: A Novel Approach for Video Relighting Through Joint Intrinsic Decomposition and Synthesis

The paper "UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting" introduces a method for the challenging task of video relighting, which necessitates precise scene understanding and sophisticated light synthesis. This paper addresses the limitations of current end-to-end relighting models that are hindered by a scarcity of diverse paired multi-illumination data, impeding their ability to generalize effectively. Traditional two-stage pipelines, which separate inverse and forward rendering, curtail data requirements but suffer from error accumulation and unrealistic outputs in complex lighting scenarios.

The proposed solution is a general-purpose approach that simultaneously estimates albedo and synthesizes relit frames in one pass, leveraging video diffusion models. This simultaneous estimation enhances scene understanding and aids in generating realistic lighting effects, enhancing albedo estimation and depiction of intricate material interactions like shadows and reflections.

Methodology and Data Strategy

The UniRelight framework positions itself as an innovative joint modeling paradigm using an advanced video diffusion model. This methodology deviates from two-stage processes, as it does not rely on explicit intermediate representations, thereby avoiding the potential inaccuracies introduced by separate modeling stages. Instead, it achieves joint relighting and decomposition via a single unified conditional generative model.

Training utilizes a hybrid strategy combining high-quality synthetic data with vast automatically labeled real-world videos, overcoming the limitations of scarce multi-illumination datasets. This dual-focus training enables the capture of complex lighting effects while ensuring robustness and generalization across different scene domains. The synthetic dataset includes various scenes with randomized lighting conditions and materials sourced from extensive 3D object and HDR map libraries, ensuring diverse training examples.

Results and Comparisons

UniRelight demonstrates superior performance over existing methods on synthetic and real-world datasets, specifically the MIT multi-illumination dataset, excelling in visual fidelity and temporal consistency. It shows improved handling of shadows, specular highlights, and complex material properties, producing more realistic relighting effects. Quantitative evaluations using metrics such as PSNR, SSIM, and LPIPS, complemented by qualitative assessments, confirm its effectiveness across diverse settings.

Moreover, the model's computational efficiency is notable, providing a faster alternative to traditional methods that require multiple rendering passes. This efficiency makes UniRelight suitable for practical applications that demand real-time or near-real-time processing.

Implications and Future Directions

The theoretical contribution of UniRelight lies in its joint learning framework, which can potentially be adapted to other image synthesis tasks beyond relighting. Practically, its applicability spans creative industries, real-time graphics applications, and fields requiring robust scene representation and rendering.

Future research could explore extensions such as integrating textual conditioning for semantic scene modifications or improving adaptability to dynamic, emissive scenes not currently addressed by the model. Furthermore, enhancing the model's resilience to biases inherent in training data and investigating its application in diverse real-world contexts remain essential for broader adoption and impact.

Related Papers

Tweets

https://twitter.com/tolgaozisik/status/1936231240543113472