Deep Appearance Models for Face Rendering (1808.00362v1)

Published 1 Aug 2018 in cs.GR and cs.CV

Abstract: We introduce a deep appearance model for rendering the human face. Inspired by Active Appearance Models, we develop a data-driven rendering pipeline that learns a joint representation of facial geometry and appearance from a multiview capture setup. Vertex positions and view-specific textures are modeled using a deep variational autoencoder that captures complex nonlinear effects while producing a smooth and compact latent representation. View-specific texture enables the modeling of view-dependent effects such as specularity. In addition, it can also correct for imperfect geometry stemming from biased or low resolution estimates. This is a significant departure from the traditional graphics pipeline, which requires highly accurate geometry as well as all elements of the shading model to achieve realism through physically-inspired light transport. Acquiring such a high level of accuracy is difficult in practice, especially for complex and intricate parts of the face, such as eyelashes and the oral cavity. These are handled naturally by our approach, which does not rely on precise estimates of geometry. Instead, the shading model accommodates deficiencies in geometry though the flexibility afforded by the neural network employed. At inference time, we condition the decoding network on the viewpoint of the camera in order to generate the appropriate texture for rendering. The resulting system can be implemented simply using existing rendering engines through dynamic textures with flat lighting. This representation, together with a novel unsupervised technique for mapping images to facial states, results in a system that is naturally suited to real-time interactive settings such as Virtual Reality (VR).

Citations (271)

View on Semantic Scholar

Summary

The paper presents a deep appearance model using conditional VAEs to learn a joint representation of facial geometry and texture from multi-camera data.
It outperforms traditional graphics pipelines by accurately rendering subtle facial details and handling complex nonlinear variations in a compact latent space.
The model achieves real-time performance, enabling robust VR and interactive applications while mitigating the need for highly precise geometric data.

An Overview of Deep Appearance Models for Face Rendering

The paper "Deep Appearance Models for Face Rendering" introduces an approach for rendering human faces using a deep appearance model. It significantly departs from traditional graphics pipelines by integrating concepts from Active Appearance Models (AAMs) and variational autoencoders (VAEs). This work leverages data captured from a multi-camera setup to learn a joint representation of facial geometry and appearance, enhancing the realism of rendered faces, particularly in virtual reality (VR) settings.

Methodology

The core innovation lies in the use of deep conditional variational autoencoders (CVAEs) to create a compact and smooth latent representation that captures high-fidelity details of both facial geometry and view-dependent appearance. Traditional AAMs utilize linear models to synthesize images based on learned correlations between shape and appearance; however, the linear nature of AAMs limits their expressive power. The authors propose a transformation to a nonlinear space using deep VAEs, which, unlike linear models, can handle complex nonlinear variations in the data.

Their system encodes and decodes the geometry and texture into a latent code using data-driven methods. A multi-camera rig captures detailed data, which enables the synthesis of realistic textures that reflect subtle facial features like specularity and geometry imperfections. At inference, the system decodes the latent code conditioned on the camera viewpoint to generate appropriate rendering, achieving more realistic results than traditional physically-inspired light transport models.

Results and Implications

The deep appearance models surpass traditional graphics techniques in adaptability and computational efficiency. The researchers demonstrate the method's robust performance in rendering facial details in high fidelity, particularly when precise geometric data isn't available. The system's ability to accommodate and correct for geometric inaccuracies through advanced texture rendering significantly lowers the data precision requirement, a common hurdle in conventional graphics workflows.

The model's real-time performance also suits it for interactive applications like VR, where dynamic facial expressions and interactions are critical. The semi-supervised technique introduced to relate the model's rendering to images captured under varying modalities heralds significant prospects in AI-driven image processing tasks.

Experimental results reflect the substantial benefits of integrating deep learning paradigms with existing image-based rendering techniques. By intelligently assembling data from different modalities, the authors have effectively sidestepped complexities associated with manual correspondence in modeling facial states across varied sensor inputs.

Future Directions

The findings prompt several avenues for advanced research and practical applications. As the methodology doesn't strictly rely on precise estimates of facial geometry, adapting this framework for broader use-cases like body rendering or different object classes could provide value in various interactive and entertainment domains. Additionally, further exploration may involve integrating lighting models as a conditioning variable to enhance relighting capabilities.

On the theoretical side, this research suggests a convergence point between traditional explicit modeling techniques and data-driven implicit learning processes. The implications for generative model design and their deployment in resource-constrained environments (like VR) offer rich prospects for expanding the boundary of interactive AI systems.

Overall, the paper provides a comprehensive blueprint for employing deep appearance models in complex real-time rendering scenarios, bridging the gap between physiological accuracy and computational feasibility in animated avatars.

PDF Markdown