SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation (2211.12194v2)

Published 22 Nov 2022 in cs.CV

Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conducted extensive experiments to demonstrate the superiority of our method in terms of motion and video quality.

Citations (168)

View on Semantic Scholar

Summary

The paper introduces SadTalker, which generates 3D motion coefficients for accurate and stylized facial and head motion from audio.
The authors employ novel networks (ExpNet and PoseVAE) to learn realistic expressions and diverse head poses using audio and 3D information.
Extensive experiments show that SadTalker achieves state-of-the-art motion synchronicity and quality while effectively preserving identity.

Overview of "SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation"

The paper presents a novel approach named SadTalker for generating realistic talking head videos from a static image and an audio clip. This work addresses key challenges in generating seamless and lifelike motion, focusing on avoiding unnatural head movements, distorted facial expressions, and identity alterations. The authors argue that these challenges arise due to reliance on coupled 2D motion fields, while current uses of 3D information often result in stiff expressions and incoherent sequences.

Core Contributions

Introduction of SadTalker: The system generates 3D motion coefficients, such as head pose and expression, from audio inputs. These coefficients are used to modulate a 3D-aware face rendering system, allowing for realistic talking head generation.
Expression and Pose Networks: The paper introduces ExpNet, designed to accurately learn facial expressions from audio by distilling coefficients and 3D-rendered faces. Additionally, PoseVAE utilizes a conditional variational autoencoder (VAE) to synthesize diverse head motions.
3D-aware Face Rendering: The generated 3D motion coefficients are mapped to a 3D keypoints space. This novel render synthesizes the final video, integrating realistic expressions and head motions.

Methodology

The approach is divided into key components:

ExpNet: Tailored for capturing realistic facial expressions, this network uses lip motion-only coefficients, alongside additional landmark and perceptual losses, to improve the synthesis accuracy.
PoseVAE: This network generates stylized, diverse head motions by learning residuals between initial and derived poses. Conditions such as audio and style identity are crucial for this mechanism.
3D-aware Rendering: Taking inspiration from face-vid2vid setups, SadTalker uses explicit 3DMM coefficients to drive video production through a learned mapping process within a 3D keypoints domain.

Results and Implications

Extensive experiments conducted on benchmark datasets illustrate SadTalker's superior video and motion quality compared to existing methods. The proposed system achieves state-of-the-art results in motion synchronicity and video quality metrics, with notable performance in identity preservation—a common challenge in this domain.

Implications for AI and Future Directions

This research underscores the importance of disentangled motion representation in enhancing the realism and quality of synthetic videos. By advancing methods to harness and integrate 3D information from audio, it opens new pathways for applications in digital human creation and enhanced virtual interactions. Potential future work might focus on extending these methods to more complex dynamic scenarios, such as incorporating emotional expressiveness and further reducing artifacts, especially in challenging visual conditions.

In conclusion, SadTalker presents a robust framework for stylized audio-driven talking head generation. This method enhances our capability to produce high-fidelity, naturalistic animations from minimal visual inputs, marking an important step forward in the field of AI-driven media synthesis.

Related Papers

GitHub

YouTube

Show All Videos