Talking Face Generation by Conditional Recurrent Adversarial Network (1804.04786v3)

Published 13 Apr 2018 in cs.CV

Abstract: Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generating the talking face video with accurate lip synchronization while maintaining smooth transition of both lip and facial movement over the entire video clip. Existing works either do not consider temporal dependency on face images across different video frames thus easily yielding noticeable/abrupt facial and lip movement or are only limited to the generation of talking face video for a specific person thus lacking generalization capacity. We propose a novel conditional video generation network where the audio input is treated as a condition for the recurrent adversarial network such that temporal dependency is incorporated to realize smooth transition for the lip and facial movement. In addition, we deploy a multi-task adversarial training scheme in the context of video generation to improve both photo-realism and the accuracy for lip synchronization. Finally, based on the phoneme distribution information extracted from the audio clip, we develop a sample selection method that effectively reduces the size of the training dataset without sacrificing the quality of the generated video. Extensive experiments on both controlled and uncontrolled datasets demonstrate the superiority of the proposed approach in terms of visual quality, lip sync accuracy, and smooth transition of lip and facial movement, as compared to the state-of-the-art.

Citations (184)

View on Semantic Scholar

Summary

The paper proposes a CRAN framework that fuses image and audio inputs to effectively model temporal dependencies for smooth talking face generation.
It employs spatial-temporal and lip-reading discriminators to enhance visual realism and accurate lip synchronization in generated videos.
Extensive experiments on datasets like TCD-TIMIT and VoxCeleb demonstrate superior PSNR, SSIM, and lip accuracy compared to existing methods.

Talking Face Generation by Conditional Recurrent Adversarial Network: A Review

This paper introduces a novel framework for generating talking face videos from arbitrary face images and speech clips, using a Conditional Recurrent Adversarial Network (CRAN). The methodology advances existing technologies by addressing key issues in temporal dependency and generalizability across diverse facial inputs. The authors' proposal aims to smooth lip synchronization and facial movement transitions in generated videos, which has significant applications in media and communications.

Methodology Overview

The paper addresses the deficiencies of prior works which often overlooked the importance of temporal dependencies across frames or were constrained to specific individuals. The authors propose a CRAN framework that leverages both audio and image features within a recurrent neural network structure. This framework is equipped with spatial-temporal discriminators to enhance overall video realism, alongside a dedicated lip-reading discriminator to refine lip synchronization accuracy.

The CRAN structure integrates:

Recurrent Image-Audio Fusion: By incorporating both audio and facial image features into the recurrent unit, the proposed architecture effectively models temporal dependencies. This approach promotes consistency and realism in the synthesized talking face sequences.
Spatial-Temporal and Lip-reading Discriminators: Two spatial-temporal discriminators ensure that the generated frames and their transitions align with natural human visual expectations. Additionally, the lip-reading discriminator specifically enhances lip movement accuracy relative to the input audio, an essential aspect of believable lip-synced video generation.

Experimental Results

The authors extensively tested their framework against existing methodologies on datasets such as TCD-TIMIT, LRW, and VoxCeleb. They report superior performance in key metrics:

Visual Clarity: The model achieves high PSNR and SSIM scores compared to state-of-the-art methods, indicating enhanced visual quality of frames.
Lip Movement Accuracy: Evaluations using landmark distance and lip-reading accuracy underscore the model's superior capability in aligning lip movements with the audio input.
Smoothness in Transitions: The proposed CRAN shows reduced frame-to-frame jitter and improved motion realism, highlighted by subjective evaluations and user studies via platforms such as Amazon Mechanical Turk.

Implications and Future Directions

The CRAN framework holds substantial implications for fields requiring realistic virtual avatars, such as telepresence, digital training systems, and entertainment industries. It outperforms previous methodologies with its architecture that adeptly balances individual frame generation with sequence continuity.

However, the paper also notes limitations like constraints in generating natural facial expressions and incorporating new poses, areas which could benefit from further exploration. Future work might encompass:

End-to-End Frameworks: Developing systems that process raw audio data directly could yield higher fidelity in lip synchronization and audio-visual coherence.
Improved Resolution Techniques: Integrating super-resolution algorithms could address quality limitations in generated video frames.

Ultimately, the work by Song et al. provides a robust foundation for advancing talking face generation technologies. By focusing on key aspects like temporal consistency and versatile audio-visual input handling, it sets a benchmark for subsequent research in synthetic facial video generation.

PDF Markdown