LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization (2106.04185v1)

Published 8 Jun 2021 in cs.CV

Abstract: In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.

Citations (94)

View on Semantic Scholar

Summary

The paper introduces a data-efficient method for generating personalized 3D talking faces using pose and lighting normalization.
It employs novel Texture and Vertex Prediction Networks that map video inputs to dynamic facial animations.
Experiments on the GRID dataset show significant improvements in landmark distance and SSIM metrics, highlighting its practical potential.

Overview of "3D Photorealistic Talking Faces from Audio (Supplementary)"

The paper presents a methodology for generating photorealistic 3D talking faces directly from audio input. It provides a comprehensive outline of network architectures, experiments, and results that demonstrate the efficacy of the proposed approach compared to existing frameworks in the domain of auditory-driven facial animation.

Network Architectures

The core innovation lies in the design of the Texture Prediction Network and Vertex Prediction Network, which utilize a series of encoder and decoder architectures to map audio signals to visual animations. The Texture Prediction Network is tasked with generating detailed textures of facial regions, particularly the mouth area, by encoding input spectra and lighting conditions and reconstructing these into a coherent visual format. Key network parameters, such as the latent vector length, are optimized to enhance performance metrics. Mini-batch stochastic gradient descent with the Adam optimizer is employed for training, effectively minimizing losses across various vectors with careful adjustment of learning rates and coefficients for moment and auto-encoder losses.

Similarly, the Vertex Prediction Network shares architectural similarities, channeling the audio spectrogram into latent vectors that direct the neural decoding towards vertex animation, a crucial step in articulating facial structure changes correlated with speech.

Experiments and Results

The paper includes compelling experimental evaluations, focusing on the Landmark Distance (LMD) metric as a primary performance indicator for latent vector optimization. An ablation paper determines the optimal latent vector length, showcasing a significant reduction in error when using a 256-dimensional vector. Furthermore, the model's performance is tested on the GRID dataset, yielding superior results in both LMD and SSIM metrics across multiple subjects. Notably, the framework surpasses recent methodologies detailed in related works, providing evidence of its advanced capability in generating realistic talking head animations.

Implications and Future Directions

The paper's findings suggest considerable improvements in the synthesis of animated talking faces from audio inputs, holding practical implications for various fields including virtual reality, telecommunications, and visual effects in media production. The architectural choices made demonstrate a potential shift in designing systems that accurately and efficiently convert auditory cues into visually perceptible animations.

Theoretical implications arise from the novel utilization of neural network frameworks to directly couple audio spectra with dynamic visual outputs, suggesting further research could explore enhancing model robustness across diverse languages and dialects or integrating multimodal inputs to enrich animation fidelity.

Future improvements might focus on refining the architectures for real-time applications, scaling the system to support longer sequences or analyzing psychosocial impacts of such realistic avatar interactions. Moreover, exploration into expanding the dataset variety and leveraging unsupervised learning could drive future advancements in this area.

By advancing the capability to synthesize lifelike 3D talking faces from audio inputs, this paper delineates an important step in bridging auditory-visual AI models with tangible real-world applications.

PDF Markdown

Related Papers

YouTube

Show All Videos