One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (2112.02749v1)

Published 6 Dec 2021 in cs.CV

Abstract: Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

Authors (4)

Suzhen Wang (16 papers)
Lincheng Li (39 papers)
Yu Ding (70 papers)
Xin Yu (192 papers)

Citations (99)

View on Semantic Scholar

Summary

Review of "One-shot Talking Face Generation with Single-speaker Audio-Visual Correlation Learning"

The paper "One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning" proposes a novel methodology for synthesizing audio-driven talking face videos using one-shot learning. The authors address the common challenges in existing models, such as unnatural mouth shapes and asynchrony between audio and visual outputs due to training on multiple speakers with differing styles. The approach leverages single-speaker data to establish a consistent speaking style that can be generalized to arbitrary speakers when given a reference image and an audio clip.

The core innovation of the paper lies in the development of an Audio-Visual Correlation Transformer (AVCT), which maps acoustic signals to dense motion fields representing facial features. The AVCT is augmented with phoneme embedding to reduce dependencies on vocal characteristics related to individual identities, thus enhancing its generalization capabilities across diverse speakers. This methodological choice allows the model to apply learned speech styles effectively to novel speakers and voices.

A significant practical component of the framework is the motion field transfer module, which accounts for differences in facial structure between the training identity and new reference images, ensuring coherent and natural face movements. The robust pipeline further employs a pretrained keypoint detector to extract baseline features from the reference image, and an image renderer to translate the learned motion fields into realistic talking face videos.

The paper presents extensive experiments, demonstrating that the proposed model surpasses state-of-the-art methods in visual quality and lip-sync accuracy. Quantitative evaluations using metrics such as FID, CPBD, and LMD reveal enhancements over competing methods across diverse datasets, including VoxCeleb2 and HDTF. The authors also conduct qualitative assessments and ablation studies, substantiating the impact of each architectural component.

The implications of this research both theoretically and practically are noteworthy. Theoretically, the approach suggests that learning audio-visual correlations from a single speaker can effectively capture consistent speech styles that are transferable across identities, offering a refined perspective for future work in generative models for talking faces. Practically, this work has significant applications across domains like digital avatars, virtual communication interfaces, and entertainment industries, where generating realistic talking faces with high fidelity is increasingly in demand.

In contemplating future developments, the integration of more sophisticated audio feature extraction methods and expansion to more complex visual dynamics, including improved handling of facial expressions and head movements, could further enhance the framework's applicability and performance. There are also potential avenues for optimizing the AVCT architecture to accommodate constraints of computational efficiency, especially in real-time applications.

Overall, the paper provides a substantial contribution to the field of audio-driven talking face generation, addressing several of the prevalent limitations in prior works and paving the way for innovative applications of audio-visual synthesis.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/camenduru/status/1753469051655930207

https://twitter.com/knishimae0531/status/1753616760232984950