A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild (2008.10010v1)

Published 23 Aug 2020 in cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: \url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}. The code and models are released at this GitHub repository: \url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at this link: \url{bhaasha.iiit.ac.in/lipsync}.

Citations (648)

View on Semantic Scholar

Summary

The paper demonstrates that a lip sync expert significantly improves synchronization accuracy in diverse and uncontrolled video conditions.
Utilizing a 2D-CNN generator with dual discriminators, Wav2Lip achieves notable performance gains in LSE-D, LSE-C, and FID metrics.
Real-world evaluations confirm that Wav2Lip excels in aligning audio and visual elements, outperforming conventional methods in both realism and synchronization.

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

The paper "A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild" introduces an innovative model named Wav2Lip, focusing on enhancing lip synchronization for talking face videos with arbitrary identities and speech segments. The paper articulates a significant advancement over existing methodologies, which frequently falter in achieving accurate lip movement synchronization in dynamic and uncontrolled video environments.

Problem Domain and Motivation

As multimedia content consumption grows, accurately lip-syncing audio with video for applications like dubbing or translation becomes crucial. Historically, models have been successful when applied to static images of known identities. However, when faced with real-world, diverse scenarios—involving different identities and speech inputs—these models often fail, mainly because they were trained on limited datasets or relied on ineffective loss functions.

Methodology

The proposed approach, Wav2Lip, leverages a powerful lip-sync discriminator for training, which ensures improved synchronization by accurately penalizing incorrect lip movements. The model is built using a generator-discriminator framework:

Generator: A 2D-CNN architecture inspired by LipGAN, tasked with generating frame-by-frame lip movements.
Discriminators: Two distinct discriminators are used. While one ensures lip-sync accuracy (trained with a pre-trained model that remains unchanged during generator training), the other discriminator focuses on the visual quality of the generated frames.

Additionally, the paper introduces new evaluation metrics—LSE-D (Lip Sync Error - Distance) and LSE-C (Lip Sync Error - Confidence)—to quantitatively measure the level of audio-visual synchronization, specifically designed to evaluate performance without altering the input setup unrealistically in real-world scenarios.

Results

The introduced Wav2Lip model is rigorously tested against standard datasets, outperforming existing methods by a substantial margin in both LSE-D and LSE-C metrics. The results show that Wav2Lip achieves synchronization accuracy almost equivalent to real videos. Furthermore, the Frechet Inception Distance (FID) scores indicate high visual fidelity of the generated outputs.

In the real-world evaluation using the ReSyncED dataset—which includes dubbed, randomly mismatched, and text-to-speech synthetic speech videos—the model demonstrates superiority, yielding quantitatively stronger results and receiving higher preference in human evaluations across multiple categories such as synchronization accuracy and visual quality.

Implications

The Wav2Lip model's proficiency in generating accurate, synchronized lip movements holds significant implications for practical applications across various domains, including video dubbing, content translation, and animated films. Practically, this development can facilitate more natural and fluent multimedia content consumption in different languages, enhancing accessibility.

Conclusion and Future Work

The paper effectively showcases the efficacy of a well-pretrained discriminator as a formidable opponent during generator training. It opens pathways for further research in enhancing expression synthesis and head pose generation, alongside lip synchronization. The detailed benchmarking framework proposed also sets the stage for standardized, consistent, and reliable evaluation of similar systems in the future.

The contribution of this work stands in its robust model architecture and the new domain-specific metrics, offering promising tools for researchers striving to bridge the existing gaps between audio and visual synchronization in multimedia applications.