Face2Face: Real-time Face Capture and Reenactment of RGB Videos (2007.14808v1)

Published 29 Jul 2020 in cs.CV

Abstract: We present Face2Face, a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

Authors (5)

Justus Thies (62 papers)
Michael Zollhöfer (51 papers)
Marc Stamminger (31 papers)
Christian Theobalt (251 papers)
Matthias Nießner (177 papers)

Citations (1,812)

View on Semantic Scholar

Summary

The paper introduces a real-time facial reenactment system that transfers live expressions from a source actor to a target video using a novel low-dimensional expression transfer function.
It employs non-rigid model-based bundling and dense photometric tracking to accurately capture target facial geometry and blend expressions with high temporal consistency.
The method achieves approximately 27.6 Hz on commodity hardware, enabling robust applications in VR/AR, teleconferencing, and film dubbing.

Face2Face: Real-time Face Capture and Reenactment of RGB Videos

The presented paper explores a novel approach for real-time facial reenactment using only monocular RGB video streams. Specifically, it describes a framework where the facial expressions from a live source actor are used to manipulate and re-render the expressions of a target video, typically a pre-recorded video such as those found on platforms like Youtube. This paper is significant as it showcases the capability to achieve realistic and temporally consistent facial reenactment in real-time, leveraging commodity hardware.

Methodology

The approach begins with facial identity recovery from the target video using non-rigid model-based bundling to address the under-constrained nature of monocular video. This preprocess ensures the identity and shape geometry of the target actor are accurately captured. During runtime, the facial expressions of both the source and target actors are tracked using a dense photometric consistency measure. The deformation transfer, which translates expressions from the source to the target, is achieved via a novel transfer function that operates in a low-dimensional expression space, significantly enhancing computational efficiency.

For the synthesis of accurate mouth interiors, the method retrieves and warps the mouth shapes from the target sequence itself, ensuring that the output remains photorealistic and consistent with the target face's appearance. The final synthesized target face is seamlessly blended with the target video considering the environmental illumination.

Results

The paper provides extensive results showcasing the reenactment of various Youtube videos in real-time. Notable metrics include the algorithm operating at approximately 27.6 Hz, demonstrating real-time performance. The presented system's ability to work with RGB videos alone is markedly efficient compared to previous methods requiring depth data. Moreover, comparisons with state-of-the-art techniques reveal that this approach yields comparable or superior tracking accuracy and reenactment quality.

Implications

The proposed method has broad implications for numerous applications. In VR and AR, this technology can enable more interactive and realistic virtual environments or avatars. In teleconferencing, it can be used to adapt video feeds to match the facial movements of a translator, enhancing communication in multilingual contexts. Additionally, this system opens new possibilities for dubbing films or videos, where actors can be made to convincingly "speak" in another language, aligning their lip movements with a foreign language track.

Future Developments

Future work might focus on addressing the limitations presented by the Lambertian surface assumption and extending the scene lighting model to handle more complex lighting variations, including hard shadows and specular highlights. There's also potential in integrating fine-scale transient surface details by enhancing the expression model beyond the current 76-dimensional space. Addressing scenes with partial occlusions due to hair or beards could further increase the robustness of the reenactment system. Finally, exploring hardware accelerations to reduce the delay introduced by the webcam and computational pipeline could refine the real-time interactive capabilities.

In conclusion, this paper lays a solid foundation for real-time facial reenactment using monocular RGB data alone, presenting a significant advancement in the domain of real-time video manipulation and synthesis. The implications for practical applications in various fields point to an exciting horizon of further research and development.

PDF Markdown

Related Papers

Head2Head++: Deep Facial Attributes Re-Targeting (2020)
ReenactGAN: Learning to Reenact Faces via Boundary Transfer (2018)
HeadOn: Real-time Reenactment of Human Portrait Videos (2018)
Deep Video Portraits (2018)
Automatic Face Reenactment (2016)