- The paper proposes Light-A-Video, a training-free pipeline that combines pretrained image relighting models with video diffusion models to achieve temporally consistent video relighting without additional training.
- The Consistent Light Attention module uses a dual-stream attention mechanism to integrate frame-wise and temporally averaged features, stabilizing background lighting across video frames.
- The Progressive Light Fusion method linearly blends video-consistent and relight targets during denoising, guiding the diffusion process toward the desired illumination while preserving motion consistency.
The paper addresses the nontrivial challenge of video relighting where a frame-by-frame application of image relighting models, even state-of-the-art ones such as IC-Light, results in temporal flickering and inconsistent illumination. To resolve these issues without incurring additional training costs, the authors propose a training-free pipeline that integrates pretrained image relighting capabilities with video diffusion models, thereby capitalizing on the latter’s intrinsic motion priors.
The method is built around two technical contributions:
Consistent Light Attention (CLA):
- The CLA module is embedded within the self-attention layers of the image relighting network.
- Instead of processing each frame individually (which leads to sharp temporal discontinuities), the module introduces a dual-stream attention mechanism. One stream processes the original frame-wise features, preserving high-frequency details, while the second stream computes temporally averaged features to mitigate high-frequency flicker.
- The two streams are merged using a trade-off parameter γ, yielding an output that stabilizes the background lighting source across frames.
- This design ensures inter-frame contextualization so that the relit illumination aligns more closely with the temporal dynamics of the source video.
Progressive Light Fusion (PLF):
- Leveraging the principle of light transport linearity, the method expresses the image appearance as a product of a light transport matrix and environment illumination, i.e., IL=TL.
- Two denoising targets are computed per time step: the inherent video-consistent target (with environment illumination Ltv) from the video diffusion model and the relight target (with illumination Ltr) generated via CLA.
- A linear blending strategy is used to fuse these targets progressively. Specifically, at each denoising step the pixel space appearance is defined as
Itp=Itv+λt(Itr−Itv),
where λt is a fusion weight that decays over the denoising steps (with t/Tm being the fraction of the denoising process completed).
- This progressive injection of relighting information guides the diffusion process gradually toward the desired lighting condition while preserving motion consistency.
Additional Aspects and Quantitative Findings:
- The pipeline is compatible with multiple video diffusion backbones, including both UNet-based and DiT-based architectures.
- In experiments, the method not only produces improved temporal coherence (as measured by CLIP score and optical flow-based motion preservation metrics) but also achieves a low FID score (approximately 29.63, compared to higher FID scores for baseline methods such as frame-by-frame application or SDEdit-based approaches).
- Moreover, the approach seamlessly supports video editing tasks like foreground relighting with synchronous background generation, thereby broadening its applicability in scenarios where both illumination and contextual consistency are required.
- The paper also provides ablation studies that quantify the contributions of both the CLA and PLF modules, demonstrating that removal of either component leads to notable degradation in both temporal consistency and image quality.
In summary, the approach leverages a zero-shot, training-free framework that combines the strengths of image relighting models with the robust temporal priors of video diffusion models to achieve coherent illumination control over video sequences. The design choices—particularly the use of dual-stream attention for cross-frame stabilization and the progressive guidance strategy during denoising—offer a technically sound mechanism for integrating pixel-level lighting adjustments with motion-consistent video generation.