You said that? (1705.02966v2)

Published 8 May 2017 in cs.CV

Abstract: We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on tens of hours of unlabelled videos. We also show results of re-dubbing videos using speech from a different person.

Citations (252)

View on Semantic Scholar

Summary

The paper introduces a CNN-based encoder-decoder model that combines MFCC audio features with VGG-M face embeddings to produce realistic lip-synced videos.
The approach bypasses classical phoneme-to-viseme mappings by directly learning audio-video correspondences, enabling synthesis for unseen voices and faces.
Experimental results demonstrate enhanced video fidelity using multiple input images, supporting applications like visual re-dubbing and immersive VR/AR experiences.

An Overview of "You said that?" by Chung, Jamaludin, and Zisserman

In "You said that?", the authors explore a novel approach to creating a video of a speaking face synchronized with an audio speech input. This method is centered on a convolutional neural network (CNN) based encoder-decoder model that integrates both audio and visual data to generate realistic lip-synced talking face videos. The system accepts two types of input: static images of a target face and an audio segment, delivering an output in real-time that can accommodate previously unencountered faces and audio inputs during training.

Methodology

At the core of this framework is a joint embedding space that fuses audio features with visual identification elements of a face. The encoder-decoder CNN configuration is pivotal, where the audio encoder leverages MFCC coefficients and the identity encoder utilizes a VGG-M network previously trained on large face datasets. Once these embeddings are established, an image decoder reconstructs the final talking face video frames that closely align with the spoken audio.

Significantly, the proposed Speech2Vid diverges from classical phoneme-to-viseme mappings, opting instead to directly learn correspondences between raw audio and video data. This methodology allows the generation of videos for unseen voices and identities, which underscores the flexibility and generalizability of the system. Training this model necessitated tens of hours of unlabeled video footage, ensuring its applicability to a diverse range of faces and spoken language patterns.

Results and Implications

Empirically, the results demonstrate the ability of the Speech2Vid model to seamlessly generate lip-synced videos across a wide variety of faces and audio segments. Notably, the paper reports substantial performance when multiple static images of the identity are incorporated into the system, enhancing the preservation of unique facial features and expressions. This aspect of using multiple image references contributes to a more faithful and natural appearance of the generated videos, addressing the complexities of micro-expressions that are not directly tied to speech.

Practical applications of this research are extensive. One demonstration within the paper highlights the potential of visual re-dubbing, where a different audio is dubbed onto an existing video with precise mouth synchronization. This application could be crucial in multimedia localization, improving lip-sync accuracy without extensive manual editing. The technique can also be pivotal in animating faces in virtual reality and augmented reality environments, where realistic lip synchronization can enhance user experience.

Theoretical Contributions

The paper contributes significantly to the domain of cross-modal machine learning, aligning advancements in audio representation with visual synthesization frameworks. This work expands the understanding of generative models that transcend traditional input-output mappings, advancing towards holistic integration across sensory modalities. It highlights the utility of CNNs beyond spatial tasks and into temporal domains traditionally dominated by recurrent networks.

Future Directions

Future avenues for this line of inquiry could involve the introduction of quantifiable metrics to assess the performance of such generative models more objectively. As the paper outlines, devising measures analogous to the inception score in lip movement accuracy could strengthen evaluative criteria for models in similar contexts. Developing such benchmarks could drive further research into improved generative techniques and their application to human-computer interactions.

In summary, "You said that?" offers a comprehensive framework for video synthesis from audio cues, marking an advance in the intersection of computer vision and audio processing. Its implications span practical applications and theoretical exploration within the fields of AI, multimedia processing, and beyond.