- The paper presents a novel three-stage framework that enhances generalization and fidelity in audio-driven 3D talking face synthesis.
- It incorporates a variational autoencoder with a flow-based prior for robust audio-to-motion mapping and a semi-supervised adversarial domain adaptation to refine 3D facial landmarks.
- The method achieves state-of-the-art performance with low FID scores and improved lip synchronization, setting a new benchmark for realistic digital humans.
GeneFace: Enhancing Audio-Driven 3D Talking Face Synthesis with Generalized NeRF-Based Methods
Introduction
The synthesis of photo-realistic talking head videos driven by arbitrary speech audio has profound applications, ranging from virtual reality to film-making. Recently, the surge in utilizing Neural Radiance Fields (NeRF) for such tasks has shown promise due to its capability to uphold 3D realness and high image fidelity. Despite these advantages, the generalizability of existing NeRF-based methods to out-of-domain audio remains circumscribed, predominantly due to the limited scale of training data. This paper introduces GeneFace, a method designed to address the generalization and fidelity challenges by leveraging a large-scale lip-reading corpus and novel architectural enhancements.
Methodology
GeneFace comprises a three-stage framework aimed at enhancing both the generalizability and fidelity of synthesized talking heads:
- Audio-to-Motion: Implements a variational motion generator utilizing a large lip-reading corpus. This stage focuses on accurately predicting 3D facial landmarks from given input audio, employing a Variational Auto-Encoder (VAE) with flow-based prior for robust motion mapping.
- Motion Domain Adaptation: To mitigate the domain shift between predicted landmarks and the target domain, a semi-supervised adversarial domain adaptive post-net is proposed. This stage refines the predicted 3D landmarks to closely match the target person's distribution.
- Motion-to-Image: Utilizes a conditioned NeRF-based renderer to synthesize high-fidelity frames based on the adapted facial landmarks. Additionally, a head-aware torso-NeRF model is introduced to seamlessly integrate head and torso rendering, reducing separation artifacts.
Experimental Results
GeneFace was evaluated extensively against state-of-the-art methods across various metrics like FID, landmark distance (LMD), and SyncNet confidence score. It demonstrated superior performance in both objective and subjective measures across in-domain and out-of-domain audio inputs. Particularly, GeneFace accomplished the lowest FID scores (22.88 for in-domain and 27.38 for out-of-domain) and showed significant improvements in lip synchronization and video realness.
Implications and Future Directions
The advancements presented in GeneFace underscore the potential of combining large-scale lip-reading corpora with NeRF-based methods for talking face generation tasks. By addressing the "mean face" problem and enhancing the renderer's generalizability, GeneFace sets a new benchmark for producing natural and high-fidelity talking head videos.
Looking forward, the exploration of integrating temporal information directly within the network architecture could further stabilize the generated landmark sequences, minimizing artifacts. Moreover, adopting recent developments in accelerated and lightweight NeRF techniques could significantly reduce training and inference times, broadening the practical applicability of GeneFace in real-world scenarios.
By pushing the frontiers of audio-driven 3D talking face synthesis, GeneFace offers exciting prospects for creating more immersive and realistic digital human representations in virtual environments.