Depth-Aware Generative Adversarial Network for Talking Head Video Generation (2203.06605v2)

Published 13 Mar 2022 in cs.CV

Abstract: Talking head video generation aims to produce a synthetic human face video that contains the identity and pose information respectively from a given source image and a driving video.Existing works for this task heavily rely on 2D representations (e.g. appearance and motion) learned from the input images. However, dense 3D facial geometry (e.g. pixel-wise depth) is extremely important for this task as it is particularly beneficial for us to essentially generate accurate 3D face structures and distinguish noisy information from the possibly cluttered background. Nevertheless, dense 3D geometry annotations are prohibitively costly for videos and are typically not available for this video generation task. In this paper, we first introduce a self-supervised geometry learning method to automatically recover the dense 3D geometry (i.e.depth) from the face videos without the requirement of any expensive 3D annotation data. Based on the learned dense depth maps, we further propose to leverage them to estimate sparse facial keypoints that capture the critical movement of the human head. In a more dense way, the depth is also utilized to learn 3D-aware cross-modal (i.e. appearance and depth) attention to guide the generation of motion fields for warping source image representations. All these contributions compose a novel depth-aware generative adversarial network (DaGAN) for talking head generation. Extensive experiments conducted demonstrate that our proposed method can generate highly realistic faces, and achieve significant results on the unseen human faces.

Authors (4)

Fa-Ting Hong (19 papers)
Longhao Zhang (7 papers)
Li Shen (363 papers)
Dan Xu (120 papers)

Citations (142)

View on Semantic Scholar

Summary

The paper introduces DaGAN, which integrates self-supervised depth recovery and 3D geometry to enhance realism in talking head videos.
It employs depth-guided facial keypoint estimation to accurately model motion and capture fine-grained expressions.
Extensive evaluations on VoxCeleb1 and CelebV show that DaGAN outperforms state-of-the-art methods in structural preservation and expression accuracy.

Depth-Aware Generative Adversarial Network for Talking Head Video Generation

The paper "Depth-Aware Generative Adversarial Network for Talking Head Video Generation" introduces a novel approach to synthesizing talking head videos by leveraging 3D geometry. This research addresses the limitations of prior methods that primarily rely on 2D representations by integrating a self-supervised learning framework to recover dense 3D facial geometry information, specifically depth maps, for more realistic animation of human faces given a single source image and a driving video.

Methodology

The core contribution of this work is the development of a Depth-aware Generative Adversarial Network (DaGAN), capable of generating high-fidelity face videos with improved structural accuracy and motion realism. The DaGAN framework is structured around several key components:

Self-Supervised Depth Learning: This component autonomously recovers pixel-wise depth maps from face videos without requiring ground-truth 3D annotations. This is achieved through geometric warping and photometric consistency, allowing the model to learn dense 3D structures from monocular video input.
Depth-Guided Facial Keypoints Estimation: The use of dense depth maps aids in estimating sparse keypoints on the face which are crucial for capturing essential facial movements. These keypoints help in modeling accurate motion fields for warping the source image to match the driving video's expressions and poses.
Cross-Modal Attention Mechanism: The paper introduces a novel attention mechanism that fuses appearance features with depth information to enhance the accuracy of facial structure and expression generation. This attention map ensures the model focuses on important facial regions, leading to a more detailed and expressive face synthesis.

Experimental Evaluation

The authors extensively evaluate DaGAN on datasets such as VoxCeleb1 and CelebV, used commonly in face reenactment tasks. The evaluation utilizes metrics like SSIM, PSNR, CSIM, PRMSE, and AUCON. DaGAN demonstrates superior performance in generating talking head videos with better structure preservation and motion realism compared to existing state-of-the-art methods. The model particularly excels in capturing fine-grained facial details and expressions, as evidenced by higher AUCON scores for expression accuracy.

Conclusion and Implications

This research highlights the significant benefits of incorporating depth information into talking head generation, offering a tool to produce more realistic and structurally accurate human face animations. DaGAN's framework can be integrated into various applications such as virtual avatars, video conferencing, and content creation in digital media. Furthermore, it sets the stage for future investigations into integrating more profound 3D understanding and multi-modal data fusion in generative models.

While the paper clearly outperforms several contemporary methods, future research could explore the scalability of DaGAN in more diverse and complex scenarios, potentially incorporating audio cues for more dynamic and context-aware talking head generation. Additionally, examining the role of more complex geometric models or extending to full-body animations could be a promising direction.

PDF Markdown

Related Papers

YouTube

Show All Videos