Real-Time Facial Segmentation and Performance Capture from RGB Input (1604.02647v1)

Published 10 Apr 2016 in cs.CV

Abstract: We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

Citations (131)

View on Semantic Scholar

Summary

The paper presents a real-time facial segmentation and performance capture method from RGB input using a CNN-based two-stage pipeline integrating segmentation and tracking.
The methodology employs a two-stream deconvolution network for segmentation and a Displaced Dynamic Expression framework for tracking, utilizing augmented training data to improve robustness against occlusions.
Evaluations show superior segmentation performance compared to state-of-the-art, significantly improving tracking robustness for real-time applications like VR/AR and mobile interfaces.

Real-Time Facial Segmentation and Performance Capture from RGB Input

The paper presents a novel approach for real-time facial performance capture from RGB input through semantic segmentation, utilizing advancements in convolutional neural networks (CNNs). This framework integrates facial segmentation and tracking, capitalizing on CNNs to achieve high-fidelity detection and robust tracking of facial features, even under occlusion. The paper introduces several key contributions and evaluations to bolster its findings.

Methodology

The authors propose a two-stage processing pipeline: facial segmentation followed by facial tracking. The segmentation stage aims to delineate facial regions from other parts of an RGB image, leveraging a deep learning architecture tailored for efficiency and accuracy. The authors utilize a two-stream deconvolution network sharing a lower convolutional network to process the facial regions. The segmentation network consists of FCN and DeconvNet architectures, which extract features and reconstruct the shape with detailed pixel-level precision.

The paper emphasizes the importance of training datasets augmented with synthetically generated facial occlusions, such as rectangles covering parts of the face and hand images superimposed onto facial areas, to enhance robustness against unpredictable occlusions. The training leverages datasets like LFW and FaceWarehouse, expanding with augmented data to mitigate overfitting.

For tracking, the paper employs a Displaced Dynamic Expression (DDE) framework to handle facial regression, augmented by the segmented facial data. This approach uses segmentation masks to exclude non-facial regions, simplifying the facial tracking problem by reducing variability introduced by background clutter and occlusions.

Results and Implications

The proposed method undergoes rigorous evaluation against state-of-the-art segmentation techniques, including structured forests and segmentation-aware part models, demonstrating superior performance in most scenarios based on metrics like Intersection over Union (IoU) and pixel-wise accuracy.

The segmentation significantly improves tracking performance by facilitating data preprocessing, enhancing robustness to occlusions or high facial variability. Practical applications of this work are notably relevant in areas requiring real-time facial animation and manipulation, such as VR/AR and mobile applications, where low latency and robust tracking are critical.

Future Directions

The paper lays the groundwork for extending segmentation capabilities to other regions of interest, such as bodies and hands, and integrating temporal information to stabilize segmentation results across consecutive frames. These improvements could provide a foundation for more complex compositing tasks and applications in real-time settings.

In conclusion, the paper provides substantial contributions to the field by introducing an efficient and robust framework for real-time facial segmentation and tracking that meticulously addresses the challenges posed by occlusions and unconstrained environments, underscoring the potential of CNNs in facilitating improved performance in digital interfaces and interactive systems.

Related Papers

YouTube

Show All Videos