- The paper introduces SynthLight, a diffusion-based model that re-renders synthetic portraits under varied HDR lighting without explicit inverse rendering.
- It employs a multi-task training strategy combining a large synthetic dataset with real-world images to balance identity preservation and lighting effects.
- Evaluations using metrics like SSIM, PSNR, and user studies demonstrate superior lighting realism and detail retention compared to existing methods.
The paper introduces SynthLight, a portrait relighting diffusion model that learns to re-render synthetic faces under varying lighting conditions. The core idea involves training a diffusion model to perform pixel transformations conditioned on lighting, effectively bypassing explicit inverse rendering.
To achieve this, the authors render a synthetic dataset of 3D human heads with diverse appearances and expressions in Blender, using the Cycles renderer. The dataset contains roughly 1.26 million images rendered at 512×512 resolution. Each sample is rendered with 10 random high dynamic range (HDR) environment maps, each rotated 36 times.
The method builds upon Stable Diffusion, incorporating the input portrait I and target environment map E into the network's input. The training samples consist of tuples (I,E,T,IR), where T is a text prompt obtained via image captioning and IR is the target relit portrait. The HDR environment map is converted to a low dynamic range (LDR) representation via tone-mapping. The LDR environment map, input, and target portraits are encoded using Stable Diffusion's VAE, resulting in latent representations Ii^, EjLDR^, and Ij^, respectively. Gaussian noise ϵ is added to the relit image latent Ij^ at timestep t to obtain a noised latent Ijt^. The U-Net ϵθ is trained with the DDPM objective:
θminEx∈Enc(IR),t,ϵ∈N(0,I)∥ϵθ(xt,I,E,T)−ϵ∥
where:
- θ are the parameters of the U-Net
- Enc is the encoder of Stable Diffusion's VAE
- IR is the relit portrait
- xt is the noised latent representation of the relit image at time t
- I is the input portrait
- E is the environment map
- T is the text prompt
- ϵ is Gaussian noise
To bridge the domain gap between synthetic and real images, the authors employ a multi-task training strategy, incorporating real-world portrait images from the LAION dataset without ground truth relighting information. During training, the sampling ratios of the synthetic dataset versus the real dataset are empirically set to 0.7 and 0.3, respectively.
During inference, an adaptation scheme is used that balances identity preservation and relighting strength. At each step of the diffusion inference, the diffusion score is a composition of scores from both image-conditional and unconditional output. The final score estimate is computed as:
ϵt=ϵθ(xt+1,ϕ,E,ϕ)+λT(ϵθ(xt+1,I,E,T)−ϵθ(xt+1,I,E,ϕ))+λI(ϵθ(xt+1,I,E,ϕ)−ϵθ(xt+1,ϕ,E,ϕ))
where:
- ϵt is the final score estimate
- ϵθ is the U-Net
- xt+1 is the latent at timestep t+1
- ϕ is a null input (i.e., dropping the input image or text prompt)
- I is the input image
- E is the environment map
- T is the text prompt
- λT is the text prompt guidance parameter
- λI is the input portrait guidance parameter
The guidance parameters λT and λI control the influence of the text prompt and input portrait, respectively. The authors empirically find that setting λI∈[2,3] achieves a balance between detail preservation and effective relighting.
The method is evaluated quantitatively on a test set of synthetic faces and a Light Stage dataset, using metrics such as SSIM, PSNR and LPIPS, as well as FaceNet distance for identity preservation. Qualitative results on in-the-wild images showcase detailed illumination effects, including cast shadows and specular highlights.
A user paper was conducted to assess perceptual lighting accuracy, identity preservation, and overall image quality, comparing SynthLight to other methods. The results indicate that SynthLight is preferred across all evaluated aspects.
Ablation studies demonstrate the contribution of the multi-task training and inference-time adaptation techniques. It is shown that multi-task training and inference time adaptation help the model to retain rich facial priors.
The paper acknowledges limitations including a lack of diversity in camera poses, facial accessories, and clothing textures in the synthetic training data.