Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces (2501.09756v1)

Published 16 Jan 2025 in cs.CV and cs.GR

Abstract: We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{https://vrroom.github.io/synthlight/}

Summary

  • The paper introduces SynthLight, a diffusion-based model that re-renders synthetic portraits under varied HDR lighting without explicit inverse rendering.
  • It employs a multi-task training strategy combining a large synthetic dataset with real-world images to balance identity preservation and lighting effects.
  • Evaluations using metrics like SSIM, PSNR, and user studies demonstrate superior lighting realism and detail retention compared to existing methods.

The paper introduces SynthLight, a portrait relighting diffusion model that learns to re-render synthetic faces under varying lighting conditions. The core idea involves training a diffusion model to perform pixel transformations conditioned on lighting, effectively bypassing explicit inverse rendering.

To achieve this, the authors render a synthetic dataset of 3D human heads with diverse appearances and expressions in Blender, using the Cycles renderer. The dataset contains roughly 1.26 million images rendered at 512×512512 \times 512 resolution. Each sample is rendered with 10 random high dynamic range (HDR) environment maps, each rotated 36 times.

The method builds upon Stable Diffusion, incorporating the input portrait II and target environment map EE into the network's input. The training samples consist of tuples (I,E,T,IR)(I, E, T, I_R), where TT is a text prompt obtained via image captioning and IRI_R is the target relit portrait. The HDR environment map is converted to a low dynamic range (LDR) representation via tone-mapping. The LDR environment map, input, and target portraits are encoded using Stable Diffusion's VAE, resulting in latent representations Ii^\hat{I_i}, EjLDR^\hat{E^{LDR}_j}, and Ij^\hat{I_j}, respectively. Gaussian noise ϵ\epsilon is added to the relit image latent Ij^\hat{I_j} at timestep tt to obtain a noised latent Ijt^\hat{I^t_j}. The U-Net ϵθ\epsilon_\theta is trained with the DDPM objective:

minθExEnc(IR),t,ϵN(0,I)ϵθ(xt,I,E,T)ϵ\min_{\theta} \mathbb{E}_{x \in Enc(I_R), t, \epsilon \in \mathcal{N}(0, I)} \|\epsilon_\theta(x_t, I, E, T) - \epsilon\|

where:

  • θ\theta are the parameters of the U-Net
  • EncEnc is the encoder of Stable Diffusion's VAE
  • IRI_R is the relit portrait
  • xtx_t is the noised latent representation of the relit image at time tt
  • II is the input portrait
  • EE is the environment map
  • TT is the text prompt
  • ϵ\epsilon is Gaussian noise

To bridge the domain gap between synthetic and real images, the authors employ a multi-task training strategy, incorporating real-world portrait images from the LAION dataset without ground truth relighting information. During training, the sampling ratios of the synthetic dataset versus the real dataset are empirically set to 0.7 and 0.3, respectively.

During inference, an adaptation scheme is used that balances identity preservation and relighting strength. At each step of the diffusion inference, the diffusion score is a composition of scores from both image-conditional and unconditional output. The final score estimate is computed as:

ϵt=ϵθ(xt+1,ϕ,E,ϕ)+λT(ϵθ(xt+1,I,E,T)ϵθ(xt+1,I,E,ϕ))+λI(ϵθ(xt+1,I,E,ϕ)ϵθ(xt+1,ϕ,E,ϕ))\epsilon_t = \epsilon_\theta(x_{t+1}, \phi, E, \phi) + \lambda_T (\epsilon_\theta(x_{t+1}, I, E, T) - \epsilon_\theta(x_{t+1}, I, E, \phi)) + \lambda_I (\epsilon_\theta(x_{t+1}, I, E, \phi) - \epsilon_\theta(x_{t+1}, \phi, E, \phi))

where:

  • ϵt\epsilon_t is the final score estimate
  • ϵθ\epsilon_\theta is the U-Net
  • xt+1x_{t+1} is the latent at timestep t+1t+1
  • ϕ\phi is a null input (i.e., dropping the input image or text prompt)
  • II is the input image
  • EE is the environment map
  • TT is the text prompt
  • λT\lambda_T is the text prompt guidance parameter
  • λI\lambda_I is the input portrait guidance parameter

The guidance parameters λT\lambda_T and λI\lambda_I control the influence of the text prompt and input portrait, respectively. The authors empirically find that setting λI[2,3]\lambda_I \in [2, 3] achieves a balance between detail preservation and effective relighting.

The method is evaluated quantitatively on a test set of synthetic faces and a Light Stage dataset, using metrics such as SSIM, PSNR and LPIPS, as well as FaceNet distance for identity preservation. Qualitative results on in-the-wild images showcase detailed illumination effects, including cast shadows and specular highlights.

A user paper was conducted to assess perceptual lighting accuracy, identity preservation, and overall image quality, comparing SynthLight to other methods. The results indicate that SynthLight is preferred across all evaluated aspects.

Ablation studies demonstrate the contribution of the multi-task training and inference-time adaptation techniques. It is shown that multi-task training and inference time adaptation help the model to retain rich facial priors.

The paper acknowledges limitations including a lack of diversity in camera poses, facial accessories, and clothing textures in the synthetic training data.

Youtube Logo Streamline Icon: https://streamlinehq.com