GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis (2301.13430v1)

Published 31 Jan 2023 in cs.CV

Abstract: Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.

Citations (99)

View on Semantic Scholar

Summary

The paper presents a novel three-stage framework that enhances generalization and fidelity in audio-driven 3D talking face synthesis.
It incorporates a variational autoencoder with a flow-based prior for robust audio-to-motion mapping and a semi-supervised adversarial domain adaptation to refine 3D facial landmarks.
The method achieves state-of-the-art performance with low FID scores and improved lip synchronization, setting a new benchmark for realistic digital humans.

GeneFace: Enhancing Audio-Driven 3D Talking Face Synthesis with Generalized NeRF-Based Methods

Introduction

The synthesis of photo-realistic talking head videos driven by arbitrary speech audio has profound applications, ranging from virtual reality to film-making. Recently, the surge in utilizing Neural Radiance Fields (NeRF) for such tasks has shown promise due to its capability to uphold 3D realness and high image fidelity. Despite these advantages, the generalizability of existing NeRF-based methods to out-of-domain audio remains circumscribed, predominantly due to the limited scale of training data. This paper introduces GeneFace, a method designed to address the generalization and fidelity challenges by leveraging a large-scale lip-reading corpus and novel architectural enhancements.

Methodology

GeneFace comprises a three-stage framework aimed at enhancing both the generalizability and fidelity of synthesized talking heads:

Audio-to-Motion: Implements a variational motion generator utilizing a large lip-reading corpus. This stage focuses on accurately predicting 3D facial landmarks from given input audio, employing a Variational Auto-Encoder (VAE) with flow-based prior for robust motion mapping.
Motion Domain Adaptation: To mitigate the domain shift between predicted landmarks and the target domain, a semi-supervised adversarial domain adaptive post-net is proposed. This stage refines the predicted 3D landmarks to closely match the target person's distribution.
Motion-to-Image: Utilizes a conditioned NeRF-based renderer to synthesize high-fidelity frames based on the adapted facial landmarks. Additionally, a head-aware torso-NeRF model is introduced to seamlessly integrate head and torso rendering, reducing separation artifacts.

Experimental Results

GeneFace was evaluated extensively against state-of-the-art methods across various metrics like FID, landmark distance (LMD), and SyncNet confidence score. It demonstrated superior performance in both objective and subjective measures across in-domain and out-of-domain audio inputs. Particularly, GeneFace accomplished the lowest FID scores (22.88 for in-domain and 27.38 for out-of-domain) and showed significant improvements in lip synchronization and video realness.

Implications and Future Directions

The advancements presented in GeneFace underscore the potential of combining large-scale lip-reading corpora with NeRF-based methods for talking face generation tasks. By addressing the "mean face" problem and enhancing the renderer's generalizability, GeneFace sets a new benchmark for producing natural and high-fidelity talking head videos.

Looking forward, the exploration of integrating temporal information directly within the network architecture could further stabilize the generated landmark sequences, minimizing artifacts. Moreover, adopting recent developments in accelerated and lightweight NeRF techniques could significantly reduce training and inference times, broadening the practical applicability of GeneFace in real-world scenarios.

By pushing the frontiers of audio-driven 3D talking face synthesis, GeneFace offers exciting prospects for creating more immersive and realistic digital human representations in virtual environments.

PDF Markdown