Speech Driven Talking Face Generation from a Single Image and an Emotion Condition (2008.03592v2)

Published 8 Aug 2020 in eess.AS, cs.CV, cs.LG, and cs.MM

Abstract: Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a human emotion recognition pilot study using generated videos with mismatched emotions among the audio and visual modalities. Results show that humans respond to the visual modality more significantly than the audio modality on this task.

View on arXiv

Authors (3)

Sefik Emre Eskimez (28 papers)
You Zhang (52 papers)
Zhiyao Duan (53 papers)

Citations (79)

View on Semantic Scholar

Summary

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

This paper presents a novel approach to generating talking face videos driven by speech, with the added complexity of expressing a specified emotion. The proposed end-to-end system takes as inputs a speech utterance, a single face image, and a categorical emotion label, transforming these into a coherent audiovisual output. The paper addresses the critical role of visual emotion expression in enhancing speech communication, iterating its importance across contexts like entertainment, education, and accessibility for individuals with hearing impairments.

The authors introduce several innovations within their system. A significant contribution is the system's ability to condition generation explicitly on categorical emotions, allowing for more controlled and application-specific emotional rendering. This is contrasted with existing methods that infer emotion from speech, an approach hampered by the inherent difficulty of accurate emotion prediction from audio alone. Notably, this disentanglement allows researchers and practitioners to explore complex multimodal interactions, offering new insights into human responses to audiovisual stimuli.

The system architecture incorporates multiple sub-networks, including speech, image, noise, and emotion encoders, as well as a video decoder. One of the core distinctions made by this work is the use of an "emotion discriminator," which aids in ensuring realistic and varied emotion expressions in generated faces. This discriminator is a sophisticated mechanism, allowing the model not only to produce lifelike facial movements synchronous with the input speech but also to correctly align these motions with the desired emotional cue.

Objective evaluations demonstrate that the proposed system outperforms established baselines in terms of image quality and audiovisual synchronization, as measured by metrics such as PSNR, SSIM, and NLMD. Additionally, a series of subjective evaluations through Amazon Mechanical Turk underscored the system's superior performance in both emotional expressiveness and perceived video realness compared to pre-existing systems.

The implications of this research extend into enriching human-computer interaction capabilities and expanding accessibility solutions for hearing-impaired populations by synthesizing video content from audio-only media. Furthermore, the system's modular design provides ample flexibility for incorporating richer emotional states or extending to more complex animations.

A thought-provoking aspect of the paper is the result from a human emotion recognition pilot paper, revealing a pronounced reliance on visual modality over audio for emotion perception in humans. This insight into human multimodal processing suggests further human-centric applications of AI, especially in emotionally adaptive interfaces.

Future research directions could involve integrating more nuanced emotional descriptors, such as those from dimensional emotion models, or extending the framework to handle dynamic backgrounds and varying illumination conditions. Moreover, advancing the model's generalizability to unseen or less-structured datasets opens avenues for broader applications across diverse real-world scenarios. The findings also set the stage for exploring more advanced applications, such as augmented and virtual reality, where realistic avatar presentation will play a critical role in immersive user experiences.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - eeskimez/emotalkingface: The code for the paper "Speech Driven Talking Face Generation from a Single Image and an Emotion Condition" (170 stars)