VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior (2312.01841v2)

Published 4 Dec 2023 in cs.CV

Abstract: Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework with a 3D hybrid prior and dual representation that improves lip synchronization, head motion, and video synthesis quality over state-of-the-art models.
The methodology integrates blendshape coefficients and vertex offsets via a multi-branch transformer to model fine-grained facial expressions and natural head movements.
Experimental results, including enhanced SyncNet and FID scores, validate VividTalk’s superior performance in generating photorealistic talking head videos.

Overview of VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

The paper "VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior" presents a comprehensive framework for generating high-quality talking head videos driven by audio input. This framework introduces innovative solutions to overcome limitations in audio-driven talking head generation, specifically addressing the challenges posed by non-rigid facial expression motion and rigid head motion.

Methodology and Technical Contributions

VividTalk is structured as a two-stage framework: Audio-To-Mesh Generation and Mesh-To-Video Generation. The Audio-To-Mesh Generation stage employs a dual representation method, utilizing both blendshape coefficients and vertex offsets mapped from audio through a multi-branch transformer-based network. This approach allows detailed modeling of facial expressions, providing a broader representation capability than previous methods that rely solely on blendshape or vertex data. The use of both representations addresses the challenge of accurately modeling fine-grained lip movements while preserving expressive facial characteristics.

In addressing head motion, which has been a persistent challenge due to its weak correlation with audio, VividTalk introduces a novel learnable head pose codebook. Through a two-phase training mechanism, the framework builds a discrete and finite head pose space, subsequently mapping audio inputs to this codebook to derive natural and continuous head motion. This approach is a significant advancement compared to prior methods that often resulted in discontinuous and unrealistic head poses.

The transition from 3D meshes to video frames, handled in the Mesh-To-Video Generation stage, involves a dual branch motion variational autoencoder (motion-VAE). This component models dense 2D motion from the driven meshes, significantly enhancing the synthesis quality of the video frames. A critical innovation here is the rendering of 3D projection textures, combined with a focus on both inner and outer facial regions, which contributes to the generation of photorealistic video content.

Performance and Evaluation

The paper's experimental results underline VividTalk's effectiveness, showcasing improvements in lip synchronization, video quality, identity preservation, and head pose diversity compared to state-of-the-art methods such as SadTalker, TalkLip, and Wav2Lip. Quantitative metrics, including SyncNet scores for lip-sync quality and Frechet Inception Distance (FID) for image quality, indicate significant advancements. The approach recognizes and addresses the importance of replicating realistic head poses, a feature less comprehensively handled by existing models.

The methodological design is further validated through diverse qualitative evaluations and user studies, suggesting that VividTalk generates more natural and expressive talking head videos. The blend of blendshape and vertex representations, coupled with a meticulously designed framework for head pose generation, collectively set a new standard in audio-driven video synthesis.

Implications and Future Directions

VividTalk's contributions have noteworthy implications for areas requiring high-fidelity digital human models, such as virtual avatars, visual dubbing, and digital communication technologies. The applied techniques and methodologies contribute crucial insights to the domain of audio-driven video synthesis by addressing previously unmet challenges in natural motion modeling.

Looking toward future developments, extending the framework's adaptability across a broader range of identities and scenarios remains a compelling trajectory. Moreover, the framework could be leveraged to explore enhanced interactions between multi-modal inputs, pushing further the boundaries of realism and personalization in AI-driven avatar generation. Integrating real-time capabilities could transform applications where live or dynamic interaction is paramount, thus broadening the operational scope of the proposed model.

In conclusion, "VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior" provides a solid methodological foundation and demonstrates significant advancements in talking head video generation. The thoughtful integration of blendshape and vertex representations, alongside innovative pose generation techniques, positions this framework as a key resource for future research and development in the field.

PDF Markdown

Related Papers

GitHub

VividTalk: One-Shot Audio-Driven Talking Head Generation Based 3D Hybrid Prior

Tweets

https://twitter.com/WilliamLamkin/status/1744538472021389419

YouTube

Show All Videos