Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision (1905.06817v1)

Published 16 May 2019 in cs.CV

Abstract: The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual's face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces `not quite in-the-wild' (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes at http://ringnet.is.tuebingen.mpg.de.

Citations (272)

View on Semantic Scholar

Summary

The paper introduces RingNet, a novel framework that regresses 3D face parameters using 2D images and a shape consistency loss.
It employs an encoder-decoder ring structure that ensures consistent face shape extraction across different expressions and poses.
Experimental results on the NoW dataset show that RingNet outperforms traditional 3D-supervised models in accuracy and robustness.

Overview of "Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision"

The paper presents RingNet, an innovative approach to regressing 3D face shape and expression parameters directly from 2D images, without the necessity for 3D ground truth supervision. This method is built on the observation that an individual's face shape remains constant across images regardless of variations in expressions, lighting, and occlusions. The goal is to offer robust 3D face reconstruction suitable for applications in virtual reality, animation, and biometrics.

RingNet exploits a novel framework combining key insights: leveraging 2D facial landmarks detected automatically and employing a unique shape consistency loss enforced across multiple images of the same individual. The model is trained with 2D face features yet circumventing traditional 3D supervised learning constraints, thereby pushing forward the potential of weakly supervised model training in computer vision.

Methodology and Technical Details

The paper introduces a detailed architecture for RingNet, which is designed to regress from image pixels to the parameters of the FLAME model—a parametric 3D face model that captures complex facial shapes and expressions. Training utilizes a corpus of in-the-wild images and a carefully devised loss function that ensures the regression network learns consistent face shapes for the same individual across different poses or expressions.

The RingNet architecture consists of a series of encoder-decoder pairs, forming a ring structure that computes the 3D face parameters. This architecture emphasizes the shape constancy by integrating a shape consistency loss that encourages the same shape features for images of the same person, while distinguishing shapes for different individuals. This is extended beyond typical triplet loss architectures to what the authors term a "shape ring," which is claimed to be essential for learning accurate 3D geometry without direct supervision.

Dataset and Evaluation

An essential contribution of this work is the introduction of the NoW (Not quite in-the-Wild) dataset, crafted to offer rigorous benchmarking capabilities for 3D face reconstruction methods. The dataset contains high-resolution images and corresponding 3D scans, providing a standard evaluation metric that assesses performance across different scenarios such as variations in pose, lighting, and occlusions.

The authors demonstrate that RingNet achieves superior performance to existing models that rely on 3D supervision, verified through comprehensive quantitative evaluations on both the NoW dataset and the benchmark dataset by Feng et al. The results highlight not only improved accuracy but also enhanced robustness across diverse real-world conditions.

Implications and Future Work

The implications of this research are significant both theoretically and practically. The successful application of a model trained exclusively on 2D image features to 3D reconstruction challenges traditional reliance on 3D supervised datasets. It opens avenues for exploring similar architectures in other domains where acquiring paired 3D data is challenging. The method’s flexibility suggests potential applications in full-body 3D modeling, provided that body landmark data can be incorporated similarly to facial landmarks.

Future explorations could involve integrating additional data, such as 3D ear morphology, or extending the approach to a full-body context using a comparable ring network framework. The prospect of simultaneously leveraging weak supervision and limited 3D supervision if available can yield even more robust models. Moreover, adapting RingNet to incorporate texture or other visual cues might further enhance the quality of reconstructed faces.

Overall, this paper lays a substantial groundwork for advancing unsupervised and semi-supervised learning approaches in 3D computer vision, challenging the necessity of expensive and cumbersome 3D supervision in diverse applications.