Identity-Preserving Talking Face Generation with Landmark and Appearance Priors (2305.08293v1)

Published 15 May 2023 in cs.CV and cs.MM

Abstract: Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.

PDF Abstract

Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

The paper at hand presents a novel approach to the problem of identity-preserving talking face generation from audio inputs, a task garnering significant interest in the fields of computer vision and artificial intelligence. Unlike person-specific solutions that require training data directly linked to the target individual and thus often encounter limitations due to data inaccessibility, the proposed methodology advances a person-generic approach. This innovation aims to generate facial videos that maintain the distinct identity of the speaker without the necessity of having tailored data.

The research introduces a two-stage framework comprising an audio-to-landmark generation phase followed by landmark-to-video rendering. Initially, the method incorporates a Transformer-based model to accurately infer lip and jaw landmarks from audio signals. Notably, this component harnesses the Transformer’s capabilities to explore prior landmark information combined with temporal dependencies more efficiently than traditional models such as long short-term memory (LSTM). Hence, it effectively addresses the ambiguity inherent in the mapping between audio signals and facial landmarks.

The second stage focuses on translating these inferred landmarks into photorealistic face images. Leveraging static reference images aligned with the target’s facial pose and expression, this phase ensures that the synthesized video content adheres closely to the audio input. This is achieved through a sophisticated alignment module employing motion fields to warp the reference imagery into explicit congruence with the target reference points. The final rendering synthesizes multi-source information, including the reference-aligned images, original lower-half masked target images, and audio features, to produce high-fidelity face frames.

The empirical evidence presented in the paper underscores the superiority of this approach. Specifically, when evaluated against leading techniques such as Wav2Lip and MakeItTalk on the LRS2 and LRS3 datasets, the proposed model excels in several key domains including Peak Signal-to-Noise Ratio (PSNR), Structured Similarity Index (SSIM), and Fréchet Inception Distance (FID). Its capability to preserve identity achieved notable higher cosine similarities of identity vectors compared to competing methods, reinforcing its effectiveness in maintaining person-specific details despite the absence of a pre-trained individual model dataset.

The model's application extends beyond generating more visually realistic and temporally coherent talking face videos. Its robust ability to retain speaker identity opens avenues in areas such as animation for virtual assistants, digital avatars, and personalized media content creation, where identity integrity is crucial. Additionally, entering the field of video dubbing, the model shows promise due to its effective synchronization capabilities as captured through the SyncScore metric evaluations.

In conclusion, this paper contributes a powerful framework that advances the generation of person-generic talking face videos by effectively integrating landmark and appearance priors. It sets a precedent for future explorations in leveraging multi-source inputs to enhance identity preservation, ensuring realistic and ethically responsible content creation in the growing intersection of audio and visual synthesis. Future works might explore even further aspects of audio-visual correlations or incorporate more dynamic reference image data to improve facial animation fidelity and personalization further.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Weizhi Zhong (10 papers)
Chaowei Fang (32 papers)
Yinqi Cai (3 papers)
Pengxu Wei (26 papers)
Gangming Zhao (23 papers)
Liang Lin (318 papers)
Guanbin Li (177 papers)

Citations (57)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos