Audio-Driven Emotional Video Portraits (2104.07452v2)

Published 15 Apr 2021 in cs.CV

Abstract: Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. Facial emotion, which is one of the most important features on natural human faces, is always neglected in their methods. In this work, we present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces, i.e., a duration-independent emotion space and a duration dependent content space. With the disentangled features, dynamic 2D emotional facial landmarks can be deduced. Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits, by bridging the gap between the deduced landmarks and the natural head poses of target videos. Extensive experiments demonstrate the effectiveness of our method both qualitatively and quantitatively.

PDF Abstract

Audio-Driven Emotional Video Portraits: An Overview

The paper "Audio-Driven Emotional Video Portraits" presents an innovative approach to synthesizing high-quality video portraits that incorporate emotional dynamics extracted from audio signals. While prior work in audio-driven talking head generation primarily focused on synchronizing speech content with lip movements, this research addresses the often-overlooked component of facial emotion, a fundamental aspect of human expressiveness.

Key Techniques and Contributions

This paper introduces and explores the concept of Emotional Video Portraits (EVP), a system that synthesizes realistic video portraits driven by audio cues and capable of dynamic emotional manipulation. The authors provide two primary innovations within this system:

Cross-Reconstructed Emotion Disentanglement:
- This technique aims to decouple emotion and speech content featured in audio signals, separating them into distinct representations. The disentangled emotion space is duration-independent, whereas the content space is duration-dependent.
- The research deploys Dynamic Time Warping (DTW) to align audio samples of different durations, thus creating pseudo training pairs necessary for effective cross-reconstruction. The cross-reconstruction process facilitates the accurate extraction and representation of emotional content from audio signals.
Target-Adaptive Face Synthesis:
- To ensure the synthesized video portraits accurately match the head poses and movements in target videos, the paper proposes a novel mechanism for face synthesis. The authors utilize 3D-aware keypoint alignment, which adjusts facial landmarks in 3D space to harmonize with target video dynamics.
- With an Edge-to-Video translation network, the emotional landmarks and edges from video inputs are rendered into high-fidelity video portraits.

Experimental Evaluation and Results

The EVP system undergoes extensive validation both qualitatively and quantitatively, showing superior performance over leading methods. The novel use of Cross-Reconstructed Emotion Disentanglement provides clearer facial emotion controls and results in more accurate audio-visual synchronization. Metrics like Landmark Distance (LD) and Landmark Velocity Difference (LVD) are employed to gauge the system's synchronization capabilities against ground truth data effectively. User studies further corroborate the emotional accuracy and realism of videos synthesized using the EVP technique.

Implications and Future Directions

The findings from this research have profound implications for multimedia applications, including filmmaking, telepresence, and digital human animation. Deploying dynamic emotional faces, accurately driven by audio, enriches the realism and expressivity of generated video content. The paper demonstrates the potential for future developments in AI, particularly in the integration of emotion in digital avatars and virtual assistants.

Looking forward, the ability to manipulate emotional features in a continuous latent space opens new avenues for personalized content generation and interactive systems with enriched emotional depths. Further research could involve the exploration of audio emotion data from diverse cultural settings to enhance model generalizability. Additionally, utilizing this technology in interactive storytelling or virtual reality applications might transform user experiences by making them more engaging and empathetic.

The work accomplished in this paper marks a significant advancement in the synthesis of talking heads—a step beyond mere lip synchronization—toward emotionally intelligent video portraiture.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xinya Ji (6 papers)
Hang Zhou (166 papers)
Kaisiyuan Wang (14 papers)
Wayne Wu (60 papers)
Chen Change Loy (288 papers)
Xun Cao (77 papers)
Feng Xu (180 papers)

Citations (196)

View on Semantic Scholar

Audio-Driven Emotional Video Portraits (2104.07452v2)

Audio-Driven Emotional Video Portraits: An Overview

Key Techniques and Contributions

Experimental Evaluation and Results

Implications and Future Directions

Related Papers