DifFRelight: Diffusion-Based Facial Performance Relighting (2410.08188v1)

Published 10 Oct 2024 in cs.CV, cs.AI, and cs.GR

Abstract: We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

Summary

The paper presents a diffusion-based framework that reconstructs dynamic facial performances from flat-lit images with high fidelity.
It employs spatially-aligned conditioning and dynamic 3D Gaussian Splatting to ensure temporal consistency and effective novel-view synthesis.
Unified lighting control combining area and directional lighting enables HDRI composition and outperforms baselines on key quantitative metrics.

The paper introduces DifFRelight, a novel framework for facial performance relighting using diffusion-based image-to-image translation. The method leverages a subject-specific dataset with varied facial expressions under different lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios. A diffusion model is trained to enable high-fidelity relit facial images from flat-lit inputs. The framework integrates spatially-aligned conditioning of flat-lit captures and random noise, along with lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. Dynamic facial performances captured in a consistent flat-lit environment are reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting (3DGS) method. The paper also introduces unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction, and enables high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions.

The paper's contributions are:

A framework for facial performance relighting that trains on subject-specific datasets and generalizes to novel lighting, enabling relighting from free viewpoints and unseen facial expressions.
A diffusion-based relighting model that spatially conditions the flat-lit input image, utilizing lighting information as global control to generate high-quality relit results.
A scalable dynamic 3DGS technique for reconstructing long sequences, ensuring temporal consistency in flat-lit inputs for coherent inference by the relighting model.
A unified lighting control that combines a new area lighting representation with directional lighting, offering versatile lighting controls and enabling the composition of complex environment lighting.

Non-Diffusion-Based Relighting Methods:

The paper notes that parametric reflectance modeling struggles with dynamic subjects and complex materials. Image-based relighting is effective but costly and challenging for dynamic subjects. Intrinsic image relighting faces challenges with complex shading effects and detail preservation. Neural relighting approaches using Neural Radiance Fields (NeRFs) face challenges in novel poses and detail preservation, while 3DGS is used for relighting due to its fine detail reconstruction.

Diffusion-Based Relighting Methods:

Conditional diffusion models are used for relighting by learning complex light interactions and leveraging a generalizable generative prior pre-trained on a large dataset of images under various lighting conditions.

The method consists of two main components: dynamic 3D performance reconstruction and diffusion-based relighting. The approach uses multi-view performance sequences in a flat-lit environment. Dynamic 3DGS is used to reconstruct deformable 3D Gaussians to render the sequence from novel perspectives. Then the diffusion-based relighting model generates new lighting for the rendered image sequence based on a specified lighting direction. This model is trained with subject-specific paired data of flat-lit and OLAT images captured using a customized LED-panel stage.

Facial Data Capture:

The capture stage is a multi-view array camera array placed within a capped cylinder of LED panels. By turning on different LED panels, the same subject is captured under varying lighting conditions. The reflectance field is captured as OLAT images using each panel in sequence as a single light source at $24$ frames per second in sync with the camera array.

Diffusion-Based Relighting:

The objective is to train a personalized model capable of relighting the flat-lit image of the same subject under novel views, novel lightings, and novel expressions/poses. The Stable Diffusion model is adapted to fine-tune it with paired data conditioned on lighting information. The light direction $\mathbf{d}$ is encoded using Spherical Harmonics (SH): $\mathbf{s}_d = \mathbf{0} \oplus \mathcal{Y}(\mathbf{d})$ , where $\oplus$ indicates concatenation, and $\mathcal{Y}$ indicates the SH encoding. $\mathcal{L}_\text{diffusion} = \|\hat{\epsilon}\left(\mathbf{z}_{OLAT}^{(t)} \oplus \mathcal{E}(\mathbf{I}_{FlatLit}); \mathbf{s}_d, t\right) - \epsilon\|_2^2$ , where $\mathbf{z}_{OLAT}^{(t)}$ is a noisy ground truth latent at diffusion time step $t$ , combining ground truth latent $\mathcal{E}(\mathbf{I}_{OLAT})$ and random noise map $\epsilon$ . $\mathbf{z}_{OLAT}^{(t)} = \sqrt{\alpha_t} \mathcal{E}(\mathbf{I}_{OLAT}) + \sqrt{1 -\alpha_t} \epsilon$ , where $\alpha_t$ is a predefined value that schedules the diffusion process. Pyramid noise is used to improve the color consistency between the prediction and the ground truth.

Dynamic Performance Relighting:

A scalable method based on the deformable 3DGS is introduced to optimize 3D Gaussians for lengthy performance sequences. This involves partitioning the sequence into small segments with an equal number of frames, allowing varying Gaussians across segments. A two-stage training strategy is designed to minimize temporal inconsistency at the transition frame between segments and preserve a similar level of reconstruction details across segments.

Unified Lighting Control:

A novel area lighting representation, including both lighting direction and a variable light size, is proposed. It is integrated with directional lighting into a unified lighting control to guide the diffusion-based relighting model. Based on the single-lighting inference, HDRI environment lighting reconstruction is enabled by compositing multiple single-lighting inferences.

Experiments:

The effectiveness of the diffusion-based relighting method is showcased on the testing data of the four captured subjects, demonstrating realistic skin texture and reflectance, eye highlights, and fine hair structures while maintaining the subject-specific identity features. Comparisons are performed against baselines constructed using different network structures: a diffusion-based model built on ControlNet and a U-Net-based model. Ablation studies are performed to verify various technical designs within the system and conduct a simple extension to assess model generalization to novel subjects.

Quantitative Results:

Our method achieves the best quantitative outcomes, significantly enhancing lighting accuracy, color fidelity, and overall image quality according to PSNR, SSIM, LPIPS, and FLIP metrics. The ControlNet-based method underperforms in preserving spatial details and color accuracy, while the U-Net-based method aligns better spatially with the flat-lit image but tends to produce blurrier results. Specifically, ControlNet achieves PSNR of 20.30, SSIM of .6708, LPIPS of .2281, and FLIP of .2304 for Novel Light. The U-Net-based network achieves PSNR of 23.66, SSIM of .7971, LPIPS of .2988, and FLIP of .1586 for Novel Light. The proposed method achieves PSNR of 30.29, SSIM of .8212, LPIPS of .1750, and FLIP of .0825 for Novel Light.

Limitations:

The technique does not completely resolve the temporal consistency issue of image-based diffusion models due to the absence of video training data, and the optical flow post-processing occasionally introduces artifacts during rapid movements. The subject-specific training is not designed to generalize to unseen subjects and can alter identity features in the relit results.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1844633491067097209

YouTube

Show All Videos