PV3D: A 3D Generative Model for Portrait Video Generation (2212.06384v3)

Published 13 Dec 2022 in cs.CV

Abstract: Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models are released at https://showlab.github.io/pv3d.

Authors (7)

Zhongcong Xu (12 papers)
Jianfeng Zhang (120 papers)
Jun Hao Liew (29 papers)
Wenqing Zhang (60 papers)
Song Bai (87 papers)
Jiashi Feng (295 papers)
Mike Zheng Shou (165 papers)

Citations (18)

View on Semantic Scholar

Summary

Overview of PV3D: A 3D Generative Model for Portrait Video Generation

The paper presents PV3D, a pioneering framework for generating high-quality, multi-view consistent portrait videos. This work extends the capabilities of Generative Adversarial Networks (GANs), specifically those involving implicit neural representations in the 3D domain. Current methods predominantly focus on generating static images or 2D videos, lacking coherence in the temporal and 3D space. PV3D addresses these limitations by introducing an integrated approach that models the spatio-temporal dynamics essential for realistic portrait video synthesis.

Methodology

Generative Framework: The authors build upon existing 3D-aware GAN methodologies by using a temporal extension of the tri-plane representation. PV3D decouples the latent space into appearance and motion components, facilitating better control over video dynamics without entangling it with 3D geometry.

Motion Generator: A motion generator is implemented within PV3D to handle the temporal aspect of the video. It constructs motion features and employs modulated convolution to adaptively synthesize content that can vary frame-by-frame. The temporal dynamics are characterized by encoding the motion code and timestep into intermediate motion codes.

Camera Conditioning: An innovative camera conditioning strategy is used to resolve ambiguities between human motion and camera movements. This involves conditioning the generator and discriminators on camera pose sequences, thereby promoting temporal coherence and maintaining consistency across views.

Dual-Discriminator Setup: PV3D employs two discriminators to regulate spatial and temporal consistency—an image discriminator focusing on individual frames and a video discriminator that improves temporal coherence by evaluating sequences of frames.

Results

The paper reports substantial improvements over state-of-the-art methods in generating 3D portrait videos, backed by both qualitative and quantitative analyses. PV3D achieves a Frechet Video Distance (FVD) of 29.1 on the VoxCeleb dataset, indicating more realistic and diverse motion generation compared to the concurrent 3DVidGen, which scores 65.5. Further evaluations demonstrate superior geometry quality and multi-view consistency, with metrics such as Multi-view Identity Consistency (ID) and warping errors affirming the robustness of the approach.

Implications and Future Directions

PV3D's ability to generate high-fidelity, consistent portrait videos paves the way for various downstream applications, including but not limited to static portrait animation and video reconstruction. The disentanglement of motion and appearance coding not only improves video quality but also holds potential for enabling fine-tuned edits in generated content—this could be particularly useful in AR/VR contexts where consistent multi-view synthesis is requisite.

From a theoretical standpoint, the paper sets the groundwork for further exploration into extending neural rendering pipelines to video data, challenging future research to address longer-duration video synthesis and richer modalities. Practical achievements in efficiently integrating high-resolution video data with robust 3D representations could accelerate advancements in AI-driven media generation tools.

Conclusion

PV3D is a notable advancement in the domain of 3D generative models, heralding the synthesis of high-quality, 3D-aware portrait videos from mere monocular inputs. The innovations in motion generation, camera conditioning, and dual-discriminator architecture underscore the potential of GAN frameworks to venture into complex temporal-spatial generation tasks. This work likely catalyzes new research trajectories in the field, enhancing both practical utility and theoretical understanding of video generation using generative models.

PDF Markdown