DreamLight: Diffusion-Based Relighting Model
- DreamLight is a diffusion-based image relighting model that unifies foreground relighting and compositing using both image and text inputs.
- It employs a Position-Guided Light Adapter to transfer directional lighting cues and a Spectral Foreground Fixer to preserve high-frequency details.
- The system demonstrates superior photometric performance, with improvements in PSNR (22.15), SSIM (0.783), and LPIPS (0.158) over previous methods.
DreamLight is a diffusion-based image relighting model designed for seamless, universal, and contextually harmonious compositing of foreground subjects into new backgrounds. Its architecture supports both image-based and text-based relighting, aiming for consistent photometric realism and aesthetic unification of the foreground and background, addressing limitations in prior harmonization and relighting pipelines. DreamLight introduces the Position-Guided Light Adapter (PGLA) for explicit modeling of background light directionality and the Spectral Foreground Fixer (SFF) for post-processing adaptive frequency alignment of subject and context, and leverages semantic priors from a pretrained diffusion model to facilitate plausible results across diverse real and synthetic images (Liu et al., 17 Jun 2025).
1. Unified Problem Formulation and Input Encoding
DreamLight approaches image relighting as a universal compositing task. The inputs are:
- A foreground portrait,
- Either a background image or a text prompt describing the desired background/lighting (setting to a null/black image for text cases).
Foreground segmentation is performed via an automated model (e.g., RMBG-1.4) to obtain binary mask ; the original background is masked out, forming .
Latent codes for and are produced using a VAE encoder (compatible with Stable Diffusion v1.5), yielding . These are concatenated with a random noise latent, resulting in . Both image and text-based conditions are passed to the central diffusion U-Net, providing a unified representation for relighting control (Liu et al., 17 Jun 2025).
2. Position-Guided Light Adapter (PGLA)
PGLA is the mechanism by which DreamLight condenses spatial light information from the background and imposes it on the foreground region, allowing for directionally-aware lighting harmonization.
- Low-frequency enhancement: Background features are processed through a CLIP-based visual encoder, then spectrally filtered with a frequency cutoff () to emphasize low-frequency (global illumination) components via FFT, Gaussian filtering, and IFFT with residual addition:
where are CLIP features, and is frequency-filtered.
- Directional light query construction: Four query sets are defined for directions , each with learnable embeddings . Attention bias masks , decaying along axis-opposing directions, are used with softmax cross-attention to pool directional lighting cues:
All query results are concatenated as .
- Foreground injection: In the U-Net’s bottleneck and decoder stages, standard cross-attention key/value sets are augmented by , and attention is restricted to foreground pixels using to prevent background cue leak.
This approach allows explicit, learnable modulation of the foreground’s lighting based on contextual cues from the background, critical for achieving scene-consistent relighting (Liu et al., 17 Jun 2025).
3. Spectral Foreground Fixer (SFF)
After core relighting, SFF operates as a localized, wavelet-based frequency rebalancer for the subject region:
- One-level Haar wavelet transform decomposes and preliminary relit output into low (LQ) and high-frequency (HQ) bands.
- A modulation network takes (from ) and (from ), predicting per-pixel weights ; the fixed high-frequency components are then
- The final foreground is recomposed by inverse wavelet, replacing into the foreground region of , yielding .
SFF thus selectively preserves high-frequency facial details while harmonizing low-frequency tone with the relighted background, improving perceptual and aesthetic foreground-background unity (Liu et al., 17 Jun 2025).
4. Diffusion Model Architecture and Training
DreamLight uses a Stable Diffusion v1.5 backbone (U-Net + VAE) as the generative prior. Text prompts are encoded via Stable Diffusion’s text encoder and, together with PGLA-derived light queries, condition the denoising U-Net at each timestep.
- Each training step optimizes:
for the diffusion process, plus a multi-term spectral fixer loss:
where is output MSE, is a VGG-based perceptual loss, and is HQ band supervision.
- Total training loss:
- Training data: 600k LoRA-generated pairs, 150k 3D Arnold/OLAT pairs, 300k IC-Light synthetic pairs, with batch size 512, learning rate (Liu et al., 17 Jun 2025).
5. Evaluation and Comparative Analysis
On a 600-pair Arnold-rendered testset, DreamLight surpasses previous harmonization and relighting benchmarks:
- PSNR: 22.15 vs. best prior 20.66
- SSIM: 0.783 vs. 0.771
- LPIPS: 0.158 vs. 0.177
- CLIP-IS: 0.908 vs. 0.896
For text-based relighting, DreamLight also delivers superior CLIP similarity (0.644), aesthetic (6.32), and ImageReward scores (3.47) on the testset. In user studies, ≥80% of participants preferred DreamLight outputs to those of IC-Light and harmonization baselines (Liu et al., 17 Jun 2025).
6. Limitations and Extensions
DreamLight assumes accurate foreground-background segmentation and typically works with 512×512 images due to model and training scale. Its explicit frequency-modeling approach is tailored to foreground-face compositing and may not trivially generalize to full-body or multi-object relighting.
The PGLA’s directional attention mechanism is currently limited to four cardinal axes. A plausible extension is to use denser or adaptive, context-sensitive directional queries. SFF currently operates only in the foreground, suggesting further extension to border/blending regions for fine-grained consistency. Incorporation of temporal consistency, learned 3D geometry priors, or object-specific adaptation remain directions for future research (Liu et al., 17 Jun 2025).
7. Usage Workflow
Operational deployment follows these steps:
- Run segmentation to extract and mask .
- Encode latents: , or black for text).
- Sample noise and form .
- Forward through U-Net with PGLA: extract background light features, form directional queries, inject via masked cross-attention.
- Denoise via diffusion steps to obtain .
- Apply SFF to foreground region, recombine.
- Composite onto the original (or new) background.
This end-to-end procedure enables text- or image-driven relighting for single portraits, supporting consistent photometric integration and controllable harmonization in diverse settings (Liu et al., 17 Jun 2025).