Light-A-Video: Training-free Video Relighting via Progressive Light Fusion (2502.08590v2)

Published 12 Feb 2025 in cs.CV

Abstract: Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers of the image relight model to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the relighted image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.

Summary

The paper proposes Light-A-Video, a training-free pipeline that combines pretrained image relighting models with video diffusion models to achieve temporally consistent video relighting without additional training.
The Consistent Light Attention module uses a dual-stream attention mechanism to integrate frame-wise and temporally averaged features, stabilizing background lighting across video frames.
The Progressive Light Fusion method linearly blends video-consistent and relight targets during denoising, guiding the diffusion process toward the desired illumination while preserving motion consistency.

The paper addresses the nontrivial challenge of video relighting where a frame-by-frame application of image relighting models, even state-of-the-art ones such as IC-Light, results in temporal flickering and inconsistent illumination. To resolve these issues without incurring additional training costs, the authors propose a training-free pipeline that integrates pretrained image relighting capabilities with video diffusion models, thereby capitalizing on the latter’s intrinsic motion priors.

The method is built around two technical contributions:

Consistent Light Attention (CLA):

The CLA module is embedded within the self-attention layers of the image relighting network.
Instead of processing each frame individually (which leads to sharp temporal discontinuities), the module introduces a dual-stream attention mechanism. One stream processes the original frame-wise features, preserving high-frequency details, while the second stream computes temporally averaged features to mitigate high-frequency flicker.
The two streams are merged using a trade-off parameter $\gamma$ , yielding an output that stabilizes the background lighting source across frames.
This design ensures inter-frame contextualization so that the relit illumination aligns more closely with the temporal dynamics of the source video.

Progressive Light Fusion (PLF):

Leveraging the principle of light transport linearity, the method expresses the image appearance as a product of a light transport matrix and environment illumination, i.e., $I_L = \mathbf{T}L$ .
Two denoising targets are computed per time step: the inherent video-consistent target (with environment illumination $L^{v}_t$ ) from the video diffusion model and the relight target (with illumination $L^{r}_t$ ) generated via CLA.
A linear blending strategy is used to fuse these targets progressively. Specifically, at each denoising step the pixel space appearance is defined as $I^{p}_t = I^{v}_t + \lambda_t (I^{r}_t - I^{v}_t),$ where $\lambda_t$ is a fusion weight that decays over the denoising steps (with $t/T_m$ being the fraction of the denoising process completed).
This progressive injection of relighting information guides the diffusion process gradually toward the desired lighting condition while preserving motion consistency.

Additional Aspects and Quantitative Findings:

The pipeline is compatible with multiple video diffusion backbones, including both UNet-based and DiT-based architectures.
In experiments, the method not only produces improved temporal coherence (as measured by CLIP score and optical flow-based motion preservation metrics) but also achieves a low FID score (approximately 29.63, compared to higher FID scores for baseline methods such as frame-by-frame application or SDEdit-based approaches).
Moreover, the approach seamlessly supports video editing tasks like foreground relighting with synchronous background generation, thereby broadening its applicability in scenarios where both illumination and contextual consistency are required.
The paper also provides ablation studies that quantify the contributions of both the CLA and PLF modules, demonstrating that removal of either component leads to notable degradation in both temporal consistency and image quality.

In summary, the approach leverages a zero-shot, training-free framework that combines the strengths of image relighting models with the robust temporal priors of video diffusion models to achieve coherent illumination control over video sequences. The design choices—particularly the use of dual-stream attention for cross-frame stabilization and the progressive guidance strategy during denoising—offer a technically sound mechanism for integrating pixel-level lighting adjustments with motion-consistent video generation.