Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning (2004.11660v2)

Published 24 Apr 2020 in cs.CV

Abstract: We propose DiscoFaceGAN, an approach for face image generation of virtual people with disentangled, precisely-controllable latent representations for identity of non-existing people, expression, pose, and illumination. We embed 3D priors into adversarial learning and train the network to imitate the image formation of an analytic 3D face deformation and rendering process. To deal with the generation freedom induced by the domain gap between real and rendered faces, we further introduce contrastive learning to promote disentanglement by comparing pairs of generated images. Experiments show that through our imitative-contrastive learning, the factor variations are very well disentangled and the properties of a generated face can be precisely controlled. We also analyze the learned latent space and present several meaningful properties supporting factor disentanglement. Our method can also be used to embed real images into the disentangled latent space. We hope our method could provide new understandings of the relationship between physical properties and deep image synthesis.

Citations (317)

View on Semantic Scholar

Summary

The paper presents a novel method that integrates 3D priors with GANs to achieve disentangled face image generation.
It employs imitative and contrastive learning to isolate and control key attributes such as identity, pose, and illumination.
Experiments demonstrate competitive FID and PPL scores, balancing image realism with precise attribute controllability.

Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning

The paper "Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning" presents a novel approach to face image generation by developing a method called DiscoFaceGAN. This approach leverages Generative Adversarial Networks (GANs) to create realistic face images of non-existent individuals, with a particular focus on disentangling and controlling multiple aspects of face characteristics, encompassing identity, expression, pose, and illumination.

Methodology and Theoretical Insights

The crux of the DiscoFaceGAN approach lies in combining 3D priors with adversarial learning to address the challenges associated with generating and controlling high-fidelity face images. By embedding 3D Morphable Models (3DMM) into the learning process, the authors aim to generate face images that reflect controllable properties guided by explicit parametric models. This incorporation of 3D priors enables the generator to imitate realistic face rendering by a semi-supervised methodology they refer to as imitative learning.

However, a domain gap naturally arises between photorealistic images and the rendered 3D face models. This gap is tackled using contrastive learning, which focuses on disentangling the variations of different facial features by carefully comparing pairs of generated images that share most but not all facial attributes. Through these comparative training schemes, the GAN is trained to identify the influence of each specific latent variable on the final image, enhancing the level of controllability over the generated content.

Numerical Results and Evaluations

The efficacy of DiscoFaceGAN is evaluated through extensive experiments showing high-quality image generation results with significant disentanglement of facial attributes. The authors report quantitative metrics such as the Fréchet Inception Distance (FID) and Perceptual Path Length (PPL) to benchmark the generation quality compared to previous models like StyleGAN. The results indicate a successful trade-off between image realism and factor disentanglement, although with an inevitable slight increase in FID due to the additional constraints of imitation and contrastive learning.

Specifically, the disentangling scores demonstrate that varying one latent variable effectively isolates the corresponding facial property change without markedly affecting others. This capability is crucial for applications requiring finely tuned control over synthetic images.

Implications and Future Directions

Practically, DiscoFaceGAN provides a robust framework for generating diverse face datasets with finely adjustable attributes, which can be significantly beneficial for multiple computer vision and graphics applications, including virtual reality, video games, and even augmenting training datasets for machine learning models.

From a theoretical standpoint, this work contributes insights into the relationships between physical properties and deep image synthesis, suggesting potential in applying similar disentangled learning strategies to other domains of image and video generation. Future developments might focus on further refining the disentanglement process, exploring larger scale synthesis that incorporates other factors such as texture dynamics, and extending the model’s applications to areas like forensic analysis and anti-spoofing technologies.

In conclusion, DiscoFaceGAN represents an advancement in controlled image generation, underscoring the importance of both theoretical understanding and practical implementations of disentangled representation learning, while opening promising avenues for future research endeavors.

PDF Markdown