VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild (2211.14758v1)

Published 27 Nov 2022 in cs.CV

Abstract: We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lip-syncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression template using the expression editing network, resulting in a video with the canonical expression. This video, together with the given audio, is then fed into the lip-sync network to generate a lip-syncing video. Finally, we improve the photo-realism of the synthesized faces through an identity-aware face enhancement network and post-processing. We use learning-based approaches for all three steps and all our modules can be tackled in a sequential pipeline without any user intervention. Furthermore, our system is a generic approach that does not need to be retrained to a specific person. Evaluations on two widely-used datasets and in-the-wild examples demonstrate the superiority of our framework over other state-of-the-art methods in terms of lip-sync accuracy and visual quality.

Citations (67)

View on Semantic Scholar

Summary

The paper introduces VideoReTalking, which decomposes audio-based lip synchronization into expression normalization, audio-driven synthesis, and face enhancement.
It leverages 3D Morphable Models, Fast Fourier Convolution, and GAN-based techniques to generate photorealistic talking head videos with improved fidelity.
Evaluations on LRS2 and HDTF datasets show substantial gains in visual quality and lip-sync accuracy, outperforming LipGAN, Wav2Lip, and PC-AVS.

Overview of VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

The paper presents VideoReTalking, an innovative system designed to edit talking head videos by synchronizing lip movements to match input audio conditions. This system divides the lip synchronization task into three distinct components: expression normalization, audio-driven lip synchronization, and face enhancement, creating a sophisticated pipeline for producing high-quality, photo-realistic videos that accurately reflect the input audio, even with different emotions.

Methodology and System Architecture

The system begins by employing a face reenactment network to stabilize expressions across video frames and reduce information leakage during lip generation. This process involves the editing of original video frames to match a canonical expression template using 3D Morphable Models (3DMM) and semantic-guided reenactment networks. This normalized expression serves as a consistent reference for the lip-sync network.

The lip-sync network is built upon an encoder-decoder architecture enhanced with Fast Fourier Convolution blocks, capable of processing audio cues via an Adaptive Instance Normalization mechanism. This network synthesizes lower-half face movements in response to the audio input while maintaining structural integrity based on the stabilized reference frames. The system leverages pre-trained lip-sync discriminators to ensure precise synchronization between visual and auditory components.

To further refine visual quality, the system includes an identity-aware face enhancement network, addressing resolution limitations intrinsic to large-scale datasets. By implementing GAN-based restoration approaches, the enhancement module elevates the resolution of output frames while preserving identity with meticulous precision.

Evaluation and Results

Extensive evaluation conducted on LRS2 and HDTF datasets demonstrates the advancements of VideoReTalking over existing methods such as LipGAN, Wav2Lip, and PC-AVS. The proposed system excels in visual quality as indicated by cumulative probability blur detection (CPBD) and Fréchet inception distance (FID) metrics, alongside remarkable improvements in lip-sync accuracy. User studies further corroborate the system's effectiveness in producing high-fidelity video edits.

Implications and Future Directions

VideoReTalking exhibits significant practical implications for industries requiring sophisticated video dubbing capabilities, such as media localization and film production. The ability to emulate various emotional states opens avenues for enhancing storytelling and audience engagement in digital content creation. From a theoretical standpoint, this paper contributes to the broader discourse on cross-modal synthesis and generative modeling, particularly in leveraging audio cues for visual transformations.

Future research may focus on extending the emotional range of video edits and integrating more advanced identity preservation mechanisms, addressing limitations observed under extreme poses or identities. Furthermore, the pursuit of high-resolution datasets could bolster training efficacy, enabling even greater visual fidelity in output results. VideoReTalking stands as a testament to the potential of AI-driven solutions in modern content creation spheres, steering developments towards more realistic and engaging virtual experiences.

PDF Markdown

Related Papers

YouTube

Show All Videos