SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces (2501.09756v1)

Published 16 Jan 2025 in cs.CV and cs.GR

Abstract: We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{https://vrroom.github.io/synthlight/}

Summary

The paper introduces SynthLight, a diffusion-based model that re-renders synthetic portraits under varied HDR lighting without explicit inverse rendering.
It employs a multi-task training strategy combining a large synthetic dataset with real-world images to balance identity preservation and lighting effects.
Evaluations using metrics like SSIM, PSNR, and user studies demonstrate superior lighting realism and detail retention compared to existing methods.

The paper introduces SynthLight, a portrait relighting diffusion model that learns to re-render synthetic faces under varying lighting conditions. The core idea involves training a diffusion model to perform pixel transformations conditioned on lighting, effectively bypassing explicit inverse rendering.

To achieve this, the authors render a synthetic dataset of 3D human heads with diverse appearances and expressions in Blender, using the Cycles renderer. The dataset contains roughly 1.26 million images rendered at $512 \times 512$ resolution. Each sample is rendered with 10 random high dynamic range (HDR) environment maps, each rotated 36 times.

The method builds upon Stable Diffusion, incorporating the input portrait $I$ and target environment map $E$ into the network's input. The training samples consist of tuples $(I, E, T, I_R)$ , where $T$ is a text prompt obtained via image captioning and $I_R$ is the target relit portrait. The HDR environment map is converted to a low dynamic range (LDR) representation via tone-mapping. The LDR environment map, input, and target portraits are encoded using Stable Diffusion's VAE, resulting in latent representations $\hat{I_i}$ , $\hat{E^{LDR}_j}$ , and $\hat{I_j}$ , respectively. Gaussian noise $\epsilon$ is added to the relit image latent $\hat{I_j}$ at timestep $t$ to obtain a noised latent $\hat{I^t_j}$ . The U-Net $\epsilon_\theta$ is trained with the DDPM objective:

$\min_{\theta} \mathbb{E}_{x \in Enc(I_R), t, \epsilon \in \mathcal{N}(0, I)} \|\epsilon_\theta(x_t, I, E, T) - \epsilon\|$

where:

$\theta$ are the parameters of the U-Net
$Enc$ is the encoder of Stable Diffusion's VAE
$I_R$ is the relit portrait
$x_t$ is the noised latent representation of the relit image at time $t$
$I$ is the input portrait
$E$ is the environment map
$T$ is the text prompt
$\epsilon$ is Gaussian noise

To bridge the domain gap between synthetic and real images, the authors employ a multi-task training strategy, incorporating real-world portrait images from the LAION dataset without ground truth relighting information. During training, the sampling ratios of the synthetic dataset versus the real dataset are empirically set to 0.7 and 0.3, respectively.

During inference, an adaptation scheme is used that balances identity preservation and relighting strength. At each step of the diffusion inference, the diffusion score is a composition of scores from both image-conditional and unconditional output. The final score estimate is computed as:

$\epsilon_t = \epsilon_\theta(x_{t+1}, \phi, E, \phi) + \lambda_T (\epsilon_\theta(x_{t+1}, I, E, T) - \epsilon_\theta(x_{t+1}, I, E, \phi)) + \lambda_I (\epsilon_\theta(x_{t+1}, I, E, \phi) - \epsilon_\theta(x_{t+1}, \phi, E, \phi))$

where:

$\epsilon_t$ is the final score estimate
$\epsilon_\theta$ is the U-Net
$x_{t+1}$ is the latent at timestep $t+1$
$\phi$ is a null input (i.e., dropping the input image or text prompt)
$I$ is the input image
$E$ is the environment map
$T$ is the text prompt
$\lambda_T$ is the text prompt guidance parameter
$\lambda_I$ is the input portrait guidance parameter

The guidance parameters $\lambda_T$ and $\lambda_I$ control the influence of the text prompt and input portrait, respectively. The authors empirically find that setting $\lambda_I \in [2, 3]$ achieves a balance between detail preservation and effective relighting.

The method is evaluated quantitatively on a test set of synthetic faces and a Light Stage dataset, using metrics such as SSIM, PSNR and LPIPS, as well as FaceNet distance for identity preservation. Qualitative results on in-the-wild images showcase detailed illumination effects, including cast shadows and specular highlights.

A user paper was conducted to assess perceptual lighting accuracy, identity preservation, and overall image quality, comparing SynthLight to other methods. The results indicate that SynthLight is preferred across all evaluated aspects.

Ablation studies demonstrate the contribution of the multi-task training and inference-time adaptation techniques. It is shown that multi-task training and inference time adaptation help the model to retain rich facial priors.

The paper acknowledges limitations including a lack of diversity in camera poses, facial accessories, and clothing textures in the synthetic training data.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EHuanglu/status/1880640598992195853

https://twitter.com/_akhaliq/status/1880124386570809600

https://twitter.com/arXivGPT/status/1881402469764284776

https://twitter.com/jack_r_saunders/status/1880196755012149351

https://twitter.com/arXivGPT/status/1880677814908731763

https://twitter.com/arXivGPT/status/1881040028823867574

YouTube

Show All Videos