AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis (2103.11078v3)

Published 20 Mar 2021 in cs.CV

Abstract: Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images. Code is available at https://github.com/YudongGuo/AD-NeRF.

Citations (330)

View on Semantic Scholar

Summary

The paper introduces an innovative end-to-end framework that bypasses intermediate representations by mapping audio features directly to dynamic neural radiance fields.
It employs dual radiance fields to separately render the head and torso, ensuring precise synchronization of facial expressions and body movements.
Experimental results demonstrate competitive naturalness and fidelity in video synthesis while significantly reducing training data requirements.

Overview of AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

The paper "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis" presents an innovative approach to generating high-fidelity talking head videos. The methodology leverages neural radiance fields (NeRFs) without relying on intermediate representations like 2D landmarks or 3D face models. The authors propose a system that directly maps audio features to a dynamic neural radiance field, rendering a synthesized video that includes both the head and upper body.

Methodology

Traditional methods for talking head synthesis often depend on intermediate models to translate audio signals to visual output, creating potential for semantic mismatches. In contrast, AD-NeRF employs an end-to-end pipeline by feeding audio features directly into a conditional implicit function. This process is facilitated through neural scene representation networks, enabling the generation of dynamic neural radiance fields that can be rendered into high-quality videos.

Key aspects of the method include:

Audio Feature Integration: The system extracts semantic features from audio using DeepSpeech and uses these as input conditions for the neural radiance fields, eliminating the need for expression coefficients or facial landmarks.
Dual Neural Radiance Fields: The framework decomposes the scene into two components—one for the head and one for the torso. This bifurcation addresses inconsistency in movements between the head and upper body, optimizing the realism of the synthesized output.
Volume Rendering Technique: The synthesized visual is generated using volumetric rendering, which maintains high fidelity in representing fine-scale facial components like teeth and hair.

Results and Implications

Experimental results validate the method's ability to produce natural-looking videos that effectively synchronize with audio inputs while allowing flexible adjustments in viewing direction and background imagery. The proposed system is competitive with traditional GAN-based approaches and offers significant advantages in terms of direct mapping from audio to visual synthesis.

The paper demonstrates robust performance across various testing scenarios, achieving comparable levels of naturalness and fidelity with significantly less training data. This reduction in training requirements presents promising opportunities for applications in digital humans, virtual conferences, and interactive robotics.

Future Directions

AD-NeRF's framework facilitates several avenues for future research:

Cross-Language and Identity Synthesis: While achieving impressive results, the model's ability to handle diverse linguistic and identity variations could be further explored to enhance versatility.
Dynamic Backgrounds: Extending the NeRF framework to accommodate more dynamic or complex backgrounds may widen the applicability in virtual and augmented reality contexts.
Fine-tuning of Motion Dynamics: Additional work could refine the system's handling of nuanced expressions and motion dynamics, particularly for non-rigid body parts.

In conclusion, the paper provides a comprehensive exploration of audio-driven talking head synthesis using neural radiance fields, setting the stage for improvements in virtual representation technologies. The innovative use of audio as a direct conduit for visual rendering represents a noteworthy advancement in neural rendering methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - YudongGuo/AD-NeRF: This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis". (1,039 stars)