Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (1807.07860v2)

Published 20 Jul 2018 in cs.CV and cs.LG

Abstract: Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

PDF Abstract

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

The paper authored by Hang Zhou et al. addresses the task of talking face generation, which involves synthesizing face image sequences aligned to specific speech clips. The complexity of this task arises from the need to disentangle face appearance variations from speech semantics, as these elements are interwoven in the facial movements observed when speaking. The authors propose a novel method that simultaneously integrates and disentangles these components, facilitating the generation of talking faces from both audio and video inputs for arbitrary subjects.

Methodology and Contributions

The authors introduce a framework referred to as the Disentangled Audio-Visual System (DAVS), which aims to learn animations of arbitrary subjects talking by disentangling identity information from speech content using adversarial training techniques. Their approach effectively separates the identity-related features from those related to verbal content through a novel associative-and-adversarial training process.

Joint Audio-Visual Embedding: The methodology begins with establishing a shared space for audio and visual representations that embed speech content. This is primarily achieved through audio-visual synchronization tasks where audio and video features that denote the same word are embedded into a common space. Notably, they employ shared classifiers and contrastive loss in this process, enhancing the association of audio and visual signals with semantic meaning.
Adversarial Learning for Disentanglement: Exploiting adversarial learning, the framework bifurcates talking face sequences into two separate representations: one for identity and another for speech information. The identity representation is trained to void any speech content, ensured by adversarially training the network to mislead a speech classification system.
End-to-End Talking Face Generation: The disentangled representations enable the DAVS system to generate high-quality and temporally coherent talking face sequences from either audio or video inputs. The decoder synthesizes the images based on the combination of the identity and speech representation.

Results and Implications

Quantitative assessments indicate that subjects generated through the proposed system exhibit clearer lip synchronization and speech-related movements compared to previous methodologies. Importantly, the use of temporal GAN loss enhances the perceptual quality of generated sequences, marked by increased PSNR and SSIM scores. These improvements bear practical implications for a range of applications where synthetic talking faces might be deployed, such as in virtual avatars, dubbing, and personalized AI communication agents.

The paper also provides a detailed analysis of audio-visual speech recognition performance, aligning with the authors’ hypothesis that learning a coherent joint space assists in lip reading tasks. The significant increase in recognition accuracy attributable to the joint embedding corroborates the robustness of the proposed method.

Future Directions

The advancements delineated in this paper pave the path for several future explorations in AI-driven visual communication. Notably, further improvements could potentially address subjects with substantial variations in expressions or head movements. Moreover, the utility of such systems could be extended into domains involving enhanced security and accessibility tools, such as applications aiding the hearing impaired by transforming audio speech into dynamic, visible lip movements.

In summary, Hang Zhou and colleagues provide a methodologically rigorous and practically significant contribution to talking face generation. By disentangling audio-visual representations via adversarial techniques, they significantly enhance the fidelity and applicability of generated facial sequences, enriching the landscape of interactive and communicative AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Hang Zhou (166 papers)
Yu Liu (786 papers)
Ziwei Liu (368 papers)
Ping Luo (340 papers)
Xiaogang Wang (230 papers)

Citations (416)

View on Semantic Scholar

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (1807.07860v2)