Video-to-Video Relighting Diffusion Model

Updated 26 January 2026

The paper's main contribution is the introduction of a diffusion-based operator that relights videos while preserving scene structure and temporal consistency.
It leverages latent space processing with explicit lighting controls like environment maps and multi-plane light images for fine-grained illumination adjustments.
This approach achieves high-fidelity relighting with competitive metrics (PSNR, SSIM, LPIPS) and offers significant benefits for visual effects, cinematography, and XR applications.

A video-to-video relighting diffusion model is a class of generative architectures that, given an input video sequence and a user-specified lighting configuration, produces a relit version of the video that exhibits physically plausible lighting modifications while preserving original scene structure, temporal consistency, and identity. These models employ conditional diffusion processes in latent (often VAE-compressed) video space, using direct or learned lighting cues—such as environment maps, multi-plane light representations, or even high-level prompts—to inject fine-grained illumination control throughout the generative process. Recent advances include the introduction of explicit lighting representations, hybrid modalities, synthetic relighting datasets, and specialized conditioning/injection mechanisms for cross-modal adaptability. This approach achieves high-fidelity, temporally coherent, and controllable relighting for dynamic scenes, with significant applications in visual effects, cinematography, and XR.

1. Problem Formulation and Motivation

Video-to-video relighting seeks to synthesize, from an input video $\{x_{0, t} \in \mathbb{R}^{H \times W \times 3}\}_{t=1, \ldots, T}$ , an output video $\{y_{0, t} \in \mathbb{R}^{H \times W \times 3}\}_{t=1, \ldots, T}$ such that the target sequence appears illuminated under user-specified lighting $L$ , while maintaining the content, structure, and temporal coherence of the original input. Formally, the mapping is $y_0 = R_\Theta(x_0, L)$ , where $R_\Theta$ is a diffusion-based operator learned from data, and $L$ may be an explicit descriptor (e.g., environment map, light configuration) or an implicit prompt (e.g., text).

Motivation for video-to-video relighting via diffusion:

Frame-by-frame relighting is prone to temporal inconsistencies such as flicker and lighting drift, due to the lack of shared context across frames (Zhou et al., 12 Feb 2025).
Precise illumination control is underconstrained in text-based or environment-map-conditioned models, limiting their use in professional or scientific workflows that demand exact light placement or color (Bian et al., 9 Nov 2025).
Collecting large paired multi-illumination datasets for real sequences is challenging, necessitating models that generalize from hybrid synthetic/real data (He et al., 18 Jun 2025, Mei et al., 18 Mar 2025, Zeng et al., 18 Aug 2025).
Applications: video post-production, visual effects, scene manipulation in virtual/augmented reality, dataset augmentation, and research in scene understanding.

2. Model Architectures and Conditioning Strategies

Video-to-video relighting diffusion models employ a variety of architectures, mostly based on latent video diffusion frameworks such as DiT (Video Diffusion Transformer), VideoUNet, or blended U-Net/Transformer hybrids (Bian et al., 9 Nov 2025, Liang et al., 30 Jan 2025, Wang et al., 28 Sep 2025). The key components are as follows:

Latent Space Processing: Videos are encoded via a (video-specific) VAE encoder $E_\text{VAE}$ , mapping $x_0 \rightarrow z_0$ , to enable computationally tractable global spatiotemporal operations (Bian et al., 9 Nov 2025, He et al., 18 Jun 2025).
Diffusion Process: Noise is incrementally added to $z_0$ across $T$ steps (via a DDPM or continuous flow-matching SDE), followed by learned denoising using neural networks conditioned on lighting cues. Common noise schedules: $q(z_t | z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I)$ .
Lighting Condition Injection: Several strategies are prominent:
- Explicit visual cues: Multi-Plane Light Image (MPLI), a stack of $K$ 2D light images parametrized by source position, intensity, color, and mapped to $K$ discrete scene depths (Bian et al., 9 Nov 2025).
- Environment maps and HDR cues: Rotationally-aligned HDR panoramas or spherical harmonics representing physically plausible environment lighting (Lin et al., 3 Jun 2025, Liang et al., 30 Jan 2025, Mei et al., 18 Mar 2025).
- Hybrid mechanisms: Direct injection into diffusion backbones via cross-attention or linear adapters that fuse light features with latent tokens at every layer (Bian et al., 9 Nov 2025, Mei et al., 18 Mar 2025, Wang et al., 28 Sep 2025).
- Semantic or modal cues: Conditioning on joint sets of albedo, normals, roughness, metallic, geometric, and high-level semantic maps, as in multimodal frameworks (Xi et al., 26 Nov 2025).
Adapter and Fine-Tuning Strategies: Lightweight adapters such as LoRA and dedicated attention-injection blocks are used for efficient fine-tuning atop large frozen backbones, mitigating catastrophic forgetting (Bian et al., 9 Nov 2025, Zeng et al., 18 Aug 2025).

A summary table of representative architectures:

Model	Conditioning Signal	Adapter/Injection
RelightMaster	MPLI (multi-plane)	Light Image Adapter
RelightVid	Env map + text	3D U-Net, cross-attn
UniRelight	RGB + HDR lighting	DiT, cross-modal attn
CtrlVDiff	Intrinsics, semantics	Multimodal 3D-UNet
Lumen	Text + masked source	DiT + LoRA adapter

3. Lighting Representations and Control

Effective relighting in video sequences requires not only physically meaningful representations of lighting but also tractable mechanisms for dynamic, precise, and possibly multi-source control.

Multi-Plane Light Image (MPLI): In RelightMaster (Bian et al., 9 Nov 2025), MPLI models lighting as a stack of $K$ 2D images, each at a different depth plane, encoding irradiance by linearly superposing the inverse-square law effects from all light sources. Depth-aligned MPLIs generalize to unseen or multi-source setups due to the linear nature of light superposition.
Environment Maps and HDR Encodings: Many systems use spherical panoramic HDR images (32×32 angular grids, per-frame) as lighting descriptors, which capture spatially-varying illumination and can be derived from synthetic scenes or real light probe captures (Lin et al., 3 Jun 2025, Mei et al., 18 Mar 2025).
Textual and Hybrid Cues: Some models accept textual prompts for high-level lighting (“soft candlelight,” “backlit neon”), encoded via CLIP or similar models, and fuse these with visual lighting cues for nuanced control (Fang et al., 27 Jan 2025, Liang et al., 30 Jan 2025, Xi et al., 26 Nov 2025).
Semantic/Intrinsic Decomposition: Models such as CtrlVDiff (Xi et al., 26 Nov 2025) and DiffusionRenderer (Liang et al., 30 Jan 2025) explicitly decompose each frame into intrinsic layers (albedo, normals, roughness, segmentation etc.). Relighting is then cast as re-rendering over these fixed intrinsics under new lighting, enabling physically-inspired control and editability.
Decoupled/Modular Approaches: ReLumix (Wang et al., 28 Sep 2025) enables artist-driven or external relighting for a reference frame, then propagates that style to the entire sequence via video diffusion, decoupling relighting from temporal coherence.

4. Training Data, Datasets, and Loss Functions

Paired video relighting data is extremely limited, so most state-of-the-art models employ hybrid datasets and novel data simulation or augmentation pipelines.

Synthetic Data Generation: Unreal Engine or custom offline path tracers are used to create controlled datasets (e.g., RelightVideo (Bian et al., 9 Nov 2025), LightAtlas (Fang et al., 27 Jan 2025)), in which dynamic content is rendered under systematically varied light positions, colors, intensities, and environment maps.
Hybrid Datasets: Models such as UniRelight (He et al., 18 Jun 2025) and Lux Post Facto (Mei et al., 18 Mar 2025) combine synthetic controlled-light videos with real-world “auto-labeled” sequences. Intrinsic properties or coarse lighting are inferred using pre-trained inverse rendering networks.
Temporal Consistency Metrics: Quantitative evaluation typically includes PSNR, SSIM, LPIPS (for frame quality); temporal metrics such as motion smoothness, flow-consistency, and tPSNR; and user/rater studies for perceptual realism and consistency (Fang et al., 27 Jan 2025, He et al., 18 Jun 2025, Mei et al., 18 Mar 2025).
Loss Formulations: Standard denoising score-matching loss in latent space is predominant; perceptual $\ell_1$ loss in VGG space is used optionally. Some models employ cycle consistency or specialized temporal losses to reinforce temporal coherence (Zeng et al., 18 Aug 2025, Wang et al., 3 Apr 2025).

5. Temporal Consistency and Cross-Frame Dynamics

Enforcing temporal stability is central to video-to-video relighting diffusion models.

Spatiotemporal Attention: 3D self-attention or convolutional kernels across time and space dimensions allow each frame to attend dynamically to its temporal neighbors, mitigating flicker (Bian et al., 9 Nov 2025, He et al., 18 Jun 2025).
Ensemble- and Flow-Augmentation: Illumination-Invariant Ensemble (IIE) (RelightVid (Fang et al., 27 Jan 2025)) regularizes per-frame predictions across brightness augmentations. Optical flow-guided warping stabilizes latent feature blending and aligns motion between frames (Wang et al., 3 Apr 2025, Jüttner et al., 27 Oct 2025).
Adapter/Layer Freezing: Only temporal or cross-attention layers are fine-tuned atop frozen spatial backbones, leveraging strong priors while permitting adaptation to multi-frame dynamics (Bian et al., 9 Nov 2025, Fang et al., 27 Jan 2025).
Hybrid and Modular Systems: “Edit-propagate” frameworks (ReLumix (Wang et al., 28 Sep 2025)) enforce global consistency by anchoring relighting to an explicit boundary frame and leveraging spatiotemporal U-Nets for style propagation.

6. Experimental Results and Comparative Analysis

Major models demonstrate substantial improvements in lighting control, consistency, and fidelity relative to prior techniques:

RelightMaster (Bian et al., 9 Nov 2025): Enables precise, physically plausible relighting across light trajectories, with content preservation and shadow/highlight fidelity surpassing text+envmap baselines such as Light-A-Video.
RelightVid (Fang et al., 27 Jan 2025): Achieves background-conditioned PSNR=18.79 (vs 18.26 per-frame IC-Light), SSIM=0.7832, and lower MotionSmoothness. Text-conditioned setting yields best CLIP alignment and user preference.
IllumiCraft (Lin et al., 3 Jun 2025): Outperforms Light-A-Video and IC-Light baselines on FVD (–45%, –37%), LPIPS, PSNR, and temporal consistency, especially under challenging text-guided or background-guided relighting.
CtrlVDiff (Xi et al., 26 Nov 2025): Surpasses DiffusionRenderer in aesthetic and imaging quality on VBench under relighting settings (Aesthetic 58.7 vs 52.4, Imaging 71.6 vs 70.9).
Lumen (Zeng et al., 18 Aug 2025): Reports foreground region PSNR up to 23.06 and LPIPS as low as 0.065 on unpaired realistic videos, with strong subjective scores on preservation and lighting harmonization.
Qualitative Outcomes: All leading models show natural shadows, specular highlights, color casts, robust response to moving and multi-source lights, and preservation of background content.

7. Limitations, Current Challenges, and Future Directions

While significant progress has been accomplished, limitations persist:

Domain Gap: Current models trained on synthetic data or auto-labeled real videos still exhibit reduced realism and control on real-world complex scenes, particularly with unusual textures/materials or occlusions (Bian et al., 9 Nov 2025, He et al., 18 Jun 2025).
Temporal Granularity: Some models process in overlapping windows or with 4-frame minimum granularity, potentially smoothing rapid lighting changes below that window (Bian et al., 9 Nov 2025, Mei et al., 18 Mar 2025).
Resolution and Inference Speed: Scalability to 4K or real-time remains limited by memory and processing time, especially for transformer-based or hybrid physically-motivated approaches (Lin et al., 3 Jun 2025, Zeng et al., 18 Aug 2025, Jüttner et al., 27 Oct 2025).
Hybrid/3D Integration: Physically-based relighting combined with neural priors (e.g., combining diffusion-based intrinsics with mesh proxy/shadow rendering (Jüttner et al., 27 Oct 2025)) represents a promising direction for production- and XR-grade relighting.
Extensions: Active research includes multi-modal and user-controllable interfaces (interactive virtual light placement), incorporation of richer lighting and geometry sources (e.g., multi-view light stage data), and learned joint depth/MPI representations for non-static scenes (Bian et al., 9 Nov 2025, Xi et al., 26 Nov 2025).
Artistic and Scientific Editing: Modular, decoupled approaches (ReLumix (Wang et al., 28 Sep 2025)) further generalize the paradigm, enabling frame-level artistic relighting and its temporally-coherent propagation.