Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamLight: Diffusion-Based Relighting Model

Updated 12 March 2026
  • DreamLight is a diffusion-based image relighting model that unifies foreground relighting and compositing using both image and text inputs.
  • It employs a Position-Guided Light Adapter to transfer directional lighting cues and a Spectral Foreground Fixer to preserve high-frequency details.
  • The system demonstrates superior photometric performance, with improvements in PSNR (22.15), SSIM (0.783), and LPIPS (0.158) over previous methods.

DreamLight is a diffusion-based image relighting model designed for seamless, universal, and contextually harmonious compositing of foreground subjects into new backgrounds. Its architecture supports both image-based and text-based relighting, aiming for consistent photometric realism and aesthetic unification of the foreground and background, addressing limitations in prior harmonization and relighting pipelines. DreamLight introduces the Position-Guided Light Adapter (PGLA) for explicit modeling of background light directionality and the Spectral Foreground Fixer (SFF) for post-processing adaptive frequency alignment of subject and context, and leverages semantic priors from a pretrained diffusion model to facilitate plausible results across diverse real and synthetic images (Liu et al., 17 Jun 2025).

1. Unified Problem Formulation and Input Encoding

DreamLight approaches image relighting as a universal compositing task. The inputs are:

  • A foreground portrait, IfgRH×W×3I_{\text{fg}} \in \mathbb{R}^{H\times W\times 3}
  • Either a background image IbgI_{\text{bg}} or a text prompt pp describing the desired background/lighting (setting IbgI_{\text{bg}} to a null/black image for text cases).

Foreground segmentation is performed via an automated model (e.g., RMBG-1.4) to obtain binary mask MfgM_{\text{fg}}; the original background is masked out, forming Ifgm(x)=Ifg(x)Mfg(x)I_{\text{fg}}^m(x) = I_{\text{fg}}(x)\cdot M_{\text{fg}}(x).

Latent codes for IfgmI_{\text{fg}}^m and IbgI_{\text{bg}} are produced using a VAE encoder (compatible with Stable Diffusion v1.5), yielding zfg,zbgz_{\text{fg}}, z_{\text{bg}}. These are concatenated with a random noise latent, resulting in Z0=concat(znoise,zfg,zbg)Z_0 = \text{concat}(z_\text{noise},z_{\text{fg}},z_{\text{bg}}). Both image and text-based conditions are passed to the central diffusion U-Net, providing a unified representation for relighting control (Liu et al., 17 Jun 2025).

2. Position-Guided Light Adapter (PGLA)

PGLA is the mechanism by which DreamLight condenses spatial light information from the background and imposes it on the foreground region, allowing for directionally-aware lighting harmonization.

  • Low-frequency enhancement: Background features are processed through a CLIP-based visual encoder, then spectrally filtered with a frequency cutoff (σ\sigma) to emphasize low-frequency (global illumination) components via FFT, Gaussian filtering, and IFFT with residual addition:

fbl=IFFT(ReLU(Conv(Flf)))+fbf_{\text{bl}} = \text{IFFT}(\text{ReLU}(\text{Conv}(F_\text{lf}))) + f_b

where fbf_b are CLIP features, and FlfF_\text{lf} is frequency-filtered.

  • Directional light query construction: Four query sets are defined for directions {left,right,top,down}\{ \text{left}, \text{right}, \text{top}, \text{down}\}, each with learnable embeddings fQdf_Q^d. Attention bias masks MdM^d, decaying along axis-opposing directions, are used with softmax cross-attention to pool directional lighting cues:

Ad=softmax((QdKb/C)+Md),fLd=AdVbA^d = \text{softmax}((Q^d K_b^\top/\sqrt{C}) + M^d), \quad f_L^d = A^d V_b

All query results are concatenated as fLf_L.

  • Foreground injection: In the U-Net’s bottleneck and decoder stages, standard cross-attention key/value sets are augmented by fLf_L, and attention is restricted to foreground pixels using MfgM_{\text{fg}} to prevent background cue leak.

This approach allows explicit, learnable modulation of the foreground’s lighting based on contextual cues from the background, critical for achieving scene-consistent relighting (Liu et al., 17 Jun 2025).

3. Spectral Foreground Fixer (SFF)

After core relighting, SFF operates as a localized, wavelet-based frequency rebalancer for the subject region:

  • One-level Haar wavelet transform decomposes IfgI_{\text{fg}} and preliminary relit output IoutI_\text{out} into low (LQ) and high-frequency (HQ) bands.
  • A modulation network M\mathcal{M} takes HQinHQ_{\text{in}} (from IfgI_{\text{fg}}) and LQoutLQ_{\text{out}} (from IoutI_{\text{out}}), predicting per-pixel weights α,β\alpha, \beta; the fixed high-frequency components are then

HQ=αHQin+βHQ' = \alpha \odot HQ_{\text{in}} + \beta

  • The final foreground is recomposed by inverse wavelet, replacing IfgI'_\text{fg} into the foreground region of IoutI_\text{out}, yielding IoutI'_\text{out}.

SFF thus selectively preserves high-frequency facial details while harmonizing low-frequency tone with the relighted background, improving perceptual and aesthetic foreground-background unity (Liu et al., 17 Jun 2025).

4. Diffusion Model Architecture and Training

DreamLight uses a Stable Diffusion v1.5 backbone (U-Net + VAE) as the generative prior. Text prompts pp are encoded via Stable Diffusion’s text encoder and, together with PGLA-derived light queries, condition the denoising U-Net at each timestep.

  • Each training step optimizes:

Ldiff=Et,z0,ϵ[ϵθ(zt,p,Ibg,Ifg,t)ϵ22]L_{\text{diff}} = \mathbb{E}_{t, z_0, \epsilon} [ \| \epsilon_\theta(z_t, p, I_\text{bg}, I_\text{fg}, t) - \epsilon \|_2^2 ]

for the diffusion process, plus a multi-term spectral fixer loss:

Lsff=λ1Lmse+λ2Lperc+λ3LHQL_\text{sff} = \lambda_1 L_\text{mse} + \lambda_2 L_\text{perc} + \lambda_3 L_{HQ}

where LmseL_\text{mse} is output MSE, LpercL_\text{perc} is a VGG-based perceptual loss, and LHQL_{HQ} is HQ band supervision.

  • Total training loss: Ltotal=Ldiff+μLsffL_\text{total} = L_\text{diff} + \mu\,L_\text{sff}
  • Training data: 600k LoRA-generated pairs, 150k 3D Arnold/OLAT pairs, 300k IC-Light synthetic pairs, with batch size 512, learning rate 5×1055\times10^{-5} (Liu et al., 17 Jun 2025).

5. Evaluation and Comparative Analysis

On a 600-pair Arnold-rendered testset, DreamLight surpasses previous harmonization and relighting benchmarks:

  • PSNR: 22.15 vs. best prior 20.66
  • SSIM: 0.783 vs. 0.771
  • LPIPS: 0.158 vs. 0.177
  • CLIP-IS: 0.908 vs. 0.896

For text-based relighting, DreamLight also delivers superior CLIP similarity (0.644), aesthetic (6.32), and ImageReward scores (3.47) on the testset. In user studies, ≥80% of participants preferred DreamLight outputs to those of IC-Light and harmonization baselines (Liu et al., 17 Jun 2025).

6. Limitations and Extensions

DreamLight assumes accurate foreground-background segmentation and typically works with 512×512 images due to model and training scale. Its explicit frequency-modeling approach is tailored to foreground-face compositing and may not trivially generalize to full-body or multi-object relighting.

The PGLA’s directional attention mechanism is currently limited to four cardinal axes. A plausible extension is to use denser or adaptive, context-sensitive directional queries. SFF currently operates only in the foreground, suggesting further extension to border/blending regions for fine-grained consistency. Incorporation of temporal consistency, learned 3D geometry priors, or object-specific adaptation remain directions for future research (Liu et al., 17 Jun 2025).

7. Usage Workflow

Operational deployment follows these steps:

  1. Run segmentation to extract MfgM_{\text{fg}} and mask IfgmI_{\text{fg}}^m.
  2. Encode latents: zfg=E(Ifgm)z_{\text{fg}} = \mathcal{E}(I_{\text{fg}}^m), zbg=E(Ibgz_{\text{bg}} = \mathcal{E}(I_{\text{bg}} or black for text).
  3. Sample noise and form Z0Z_0.
  4. Forward through U-Net with PGLA: extract background light features, form directional queries, inject via masked cross-attention.
  5. Denoise via diffusion steps to obtain IoutI_\text{out}.
  6. Apply SFF to foreground region, recombine.
  7. Composite IoutI_\text{out}' onto the original (or new) background.

This end-to-end procedure enables text- or image-driven relighting for single portraits, supporting consistent photometric integration and controllable harmonization in diverse settings (Liu et al., 17 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamLight.