Efficient Geometry-aware 3D Generative Adversarial Networks (2112.07945v2)

Published 15 Dec 2021 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.

Authors (12)

Eric R. Chan (11 papers)
Connor Z. Lin (7 papers)
Matthew A. Chan (4 papers)
Koki Nagano (27 papers)
Boxiao Pan (9 papers)
Shalini De Mello (45 papers)
Orazio Gallo (26 papers)
Leonidas Guibas (177 papers)
Jonathan Tremblay (43 papers)
Sameh Khamis (18 papers)
Tero Karras (26 papers)
Gordon Wetzstein (144 papers)

Citations (1,059)

View on Semantic Scholar

Summary

Efficient Geometry-aware 3D Generative Adversarial Networks

Overview

The paper addresses the challenge of unsupervised generation of high-quality, multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs. Existing 3D GANs either suffer from being compute-intensive or they rely on approximations that compromise 3D consistency. This research proposes a novel architecture that enhances computational efficiency and image quality without overly relying on such approximations.

Key Contributions

Hybrid Explicit-Implicit Network Architecture: The proposed method employs a tri-plane-based 3D representation that allows for efficient and expressive high-resolution geometry-aware image synthesis. It combines the strengths of both explicit voxel grids and neural implicit representations while mitigating their individual limitations.
Dual Discrimination Strategy: By employing a dual-discrimination approach, the model ensures that neural renderings remain consistent with image final outputs, thereby regularizing view inconsistency tendencies introduced by image-space convolutional upsampling.
Pose-conditioned Generator: The generator is conditioned on camera poses, which decouples pose-correlated attributes (e.g., facial expressions) from the final output, enhancing multi-view consistency during inference.
Integration with State-of-the-art 2D GANs: The framework incorporates StyleGAN2 as the 2D CNN-based feature generator. This enables leveraging the efficiency and expressiveness of state-of-the-art 2D GANs to produce highly detailed and consistent multi-view renderings.

Methodology

Representation and Rendering: The tri-plane representation aligns explicit features along three axis-aligned orthogonal feature planes. Features are aggregated via summation, interpreted by a small MLP, and rendered into 2D feature images using neural volume rendering.
Super-resolution Module: To overcome the limitations in computational efficiency, volume rendering is performed at a moderate resolution and later upsampled to higher resolutions using image-space convolutions.
Conditional Discrimination: The dual discrimination technique ensures that raw neural renderings and super-resolved outputs maintain high fidelity and consistency. Additionally, conditioning the discriminator on camera poses helps the generator to adhere to 3D priors.

Results

The proposed method demonstrates state-of-the-art results in unconditional 3D-aware image synthesis. Comparisons against leading techniques such as GIRAFFE, $\pi$ -GAN, and Lifting StyleGAN highlight substantial improvements in image quality, multi-view consistency, and 3D geometry quality.

Qualitative Results: The generated images exhibit higher fidelity and detailed structures compared to the baselines.
Quantitative Evaluations: The model achieves superior scores in FID, identity consistency, depth accuracy, and pose accuracy across datasets like FFHQ and AFHQ Cats.

Implications

This work represents a significant advancement in 3D-aware GANs, enabling efficient, high-resolution, and consistent multi-view image synthesis extending the capabilities of StyleGAN2. Practically, it opens new avenues in applications such as 3D character design, virtual reality, and content creation by allowing better manipulation of 3D models and scenes from 2D images. Theoretically, the insights and techniques developed here can inspire further research into hybrid representations and pose-conditioned generative models.

Future Directions

The method provides a rich basis for future explorations in the following areas:

Enhanced Geometry Quality: Future research could incorporate stronger geometry priors and regularization techniques to further enhance the quality of the 3D shapes generated.
Pose Estimation Integration: Integration of more robust pose estimation techniques or learning the pose distributions dynamically could enhance the flexibility and robustness of the model.
Alternative Backbone Models: Experimenting with different kinds of backbone models, like transformer-based architectures, could reveal new opportunities in conditional and controlled image synthesis.

The strong numerical results and qualitative improvements shown in this paper underscore the potential of the proposed hybrid architecture and dual-discrimination strategy. This could serve as a cornerstone for subsequent advancements in the domain of 3D-aware generative adversarial networks.

PDF Markdown

Related Papers

YouTube

Show All Videos