Lip Movements Generation at a Glance (1803.10404v3)

Published 28 Mar 2018 in cs.CV

Abstract: Cross-modality generation is an emerging topic that aims to synthesize data in one modality based on information in a different modality. In this paper, we consider a task of such: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech. To perform well in this task, it inevitably requires a model to not only consider the retention of target identity, photo-realistic of synthesized images, consistency and smoothness of lip images in a sequence, but more importantly, learn the correlations between audio speech and lip movements. To solve the collective problems, we explore the best modeling of the audio-visual correlations in building and training a lip-movement generator network. Specifically, we devise a method to fuse audio and image embeddings to generate multiple lip images at once and propose a novel correlation loss to synchronize lip changes and speech changes. Our final model utilizes a combination of four losses for a comprehensive consideration in generating lip movements; it is trained in an end-to-end fashion and is robust to lip shapes, view angles and different facial characteristics. Thoughtful experiments on three datasets ranging from lab-recorded to lips in-the-wild show that our model significantly outperforms other state-of-the-art methods extended to this task.

Authors (5)

Lele Chen (22 papers)
Zhiheng Li (67 papers)
Ross K. Maddox (3 papers)
Zhiyao Duan (54 papers)
Chenliang Xu (114 papers)

Citations (241)

View on Semantic Scholar

Summary

The paper introduces an innovative generator network using a derivative-based correlation loss for precise audio-visual alignment.
It employs a comprehensive four-component loss function and a three-stream GAN discriminator to enhance video realism and motion coherence.
Evaluation on GRID, LDC, and LRW datasets shows superior lip synchronization accuracy and overall video quality compared to state-of-the-art methods.

Lip Movements Generation at a Glance: An Expert Analysis

The paper "Lip Movements Generation at a Glance" explores the relatively new domain of cross-modality generation, specifically focusing on synthesizing lip movements from audio speech and a single image of a target identity. This research addresses a multifaceted challenge: the synchronization of audio speech with accurate lip movements across diverse identities while maintaining photo-realistic video quality, identity consistency, and smooth motion transitions. These objectives are fundamental for practical applications such as enhancing speech comprehension and supporting hearing-impaired devices.

Core Contributions and Methodology

The research presents a comprehensive system grounded in the modeling of audio-visual correlations. The authors propose an innovative generator network augmented by a novel audio-visual correlation loss. The core methodology involves:

Audio-Visual Feature Fusion: The approach fuses audio features (obtained from log-mel spectrograms) and visual features from the identity image. This fusion focuses on duplicating and concatenating attributes to accommodate different temporal and spatial dimensions, which are then extended to generate video.
Derivative-Based Correlation Loss: A bespoke correlation model is introduced to capture correlations between audio and visual modalities more effectively. The derivative of audio features is compared with optical flow-derived features through cosine similarity, overcoming potential offsets in synchronization.
Comprehensive Loss Function: A four-component loss model — audio-visual correlation loss, perceptual loss for feature space comparisons, pixel-level loss, and an advanced GAN-based adversarial loss — enhances the robustness of lip-movement synthesis across various dimensions.
Three-Stream GAN Discriminator: By incorporating distinct audio, video, and optical flow streams, the discriminator improves the model's capability to produce realistic and temporally coherent video sequences.

Experimental Evaluation

The model is rigorously tested across three datasets: GRID, LDC, and LRW, representing varying controlled to wild recording conditions. Evaluation metrics such as Peak Signal to Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) are used to assess video quality, whereas Landmark Distance (LMD) metrics are developed to evaluate lip synchronization accuracy.

The results demonstrate superior performance over state-of-the-art methods, specifically highlighting enhancements in lip movement accuracy (LMD), image sharpness (CPBD), and overall video consistency. The integration of correlation loss and a well-formulated discriminator are key differentiators leading to the model's success.

Implications and Future Directions

This research contributes significantly to the understanding and advancement of audiovisual synthesis in AI, offering insights into better handling cross-modality generation tasks. The implications extend to applications in multimedia content creation, augmented reality, and assistive technologies.

Future work could explore the synthesis of lip movements for extended video lengths and multiple identities using minimal input data. Another promising direction is an expansion towards full facial animations, integrating advanced LLMs to comprehend and interpret speech context.

Overall, the paper demonstrates a meticulous blend of theoretical and practical AI advancements in video generation, setting a solid foundation for subsequent innovation in audiovisual synthesis.

PDF Markdown

Related Papers

YouTube

Show All Videos