Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
The paper at hand presents a novel approach to the problem of identity-preserving talking face generation from audio inputs, a task garnering significant interest in the fields of computer vision and artificial intelligence. Unlike person-specific solutions that require training data directly linked to the target individual and thus often encounter limitations due to data inaccessibility, the proposed methodology advances a person-generic approach. This innovation aims to generate facial videos that maintain the distinct identity of the speaker without the necessity of having tailored data.
The research introduces a two-stage framework comprising an audio-to-landmark generation phase followed by landmark-to-video rendering. Initially, the method incorporates a Transformer-based model to accurately infer lip and jaw landmarks from audio signals. Notably, this component harnesses the Transformer’s capabilities to explore prior landmark information combined with temporal dependencies more efficiently than traditional models such as long short-term memory (LSTM). Hence, it effectively addresses the ambiguity inherent in the mapping between audio signals and facial landmarks.
The second stage focuses on translating these inferred landmarks into photorealistic face images. Leveraging static reference images aligned with the target’s facial pose and expression, this phase ensures that the synthesized video content adheres closely to the audio input. This is achieved through a sophisticated alignment module employing motion fields to warp the reference imagery into explicit congruence with the target reference points. The final rendering synthesizes multi-source information, including the reference-aligned images, original lower-half masked target images, and audio features, to produce high-fidelity face frames.
The empirical evidence presented in the paper underscores the superiority of this approach. Specifically, when evaluated against leading techniques such as Wav2Lip and MakeItTalk on the LRS2 and LRS3 datasets, the proposed model excels in several key domains including Peak Signal-to-Noise Ratio (PSNR), Structured Similarity Index (SSIM), and Fréchet Inception Distance (FID). Its capability to preserve identity achieved notable higher cosine similarities of identity vectors compared to competing methods, reinforcing its effectiveness in maintaining person-specific details despite the absence of a pre-trained individual model dataset.
The model's application extends beyond generating more visually realistic and temporally coherent talking face videos. Its robust ability to retain speaker identity opens avenues in areas such as animation for virtual assistants, digital avatars, and personalized media content creation, where identity integrity is crucial. Additionally, entering the field of video dubbing, the model shows promise due to its effective synchronization capabilities as captured through the SyncScore metric evaluations.
In conclusion, this paper contributes a powerful framework that advances the generation of person-generic talking face videos by effectively integrating landmark and appearance priors. It sets a precedent for future explorations in leveraging multi-source inputs to enhance identity preservation, ensuring realistic and ethically responsible content creation in the growing intersection of audio and visual synthesis. Future works might explore even further aspects of audio-visual correlations or incorporate more dynamic reference image data to improve facial animation fidelity and personalization further.