DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis (2212.11984v1)

Published 22 Dec 2022 in cs.CV

Abstract: Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3Daware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Project page: https://snap-research.github.io/discoscene/

Citations (52)

View on Semantic Scholar

Summary

The paper introduces a model that leverages spatially disentangled radiance fields with 3D bounding boxes for efficient, controllable scene synthesis.
It utilizes a global-local discrimination strategy to separate objects from the background, achieving competitive FID and KID scores on diverse datasets.
The approach supports intuitive 3D scene editing, enhancing applications in virtual reality, 3D content creation, and simulation.

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

The paper, "DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis," addresses a gap in 3D-aware image synthesis by focusing on generating complex scenes composed of multiple objects, rather than just individual canonical objects. The authors propose a novel generative model, DisCoScene, which leverages a simple yet powerful representation based on 3D bounding boxes to facilitate controllable scene synthesis. This approach not only enhances scene generation fidelity but also supports intuitive and flexible user interactions for scene editing.

Methodology

DisCoScene introduces a spatially disentangled radiance field that divides the scene into object-centric components. This division enables each object to be represented independently within a unified framework, facilitating efficient scene synthesis and editing from single-view 2D data. The core of the method relies on an abstract object-level representation that uses 3D bounding boxes as a foundational layout prior. This prior serves as an input to the model, providing a way to organize and spatially disentangle objects and background, which are later synthesized into harmonized scenes.

Key to the model's effectiveness is its ability to spatially condition object generation on their respective locations and scales within scenes. This spatial conditioning helps retain logical object semantics and enhances the model's capability to learn from complex scenes. Additionally, the model incorporates a global-local discrimination mechanism during training. This mechanism includes scene and object-level discriminators, improving the granularity of object generation and enabling better disentanglement from the background.

Numerical Results and Performance

The method demonstrates strong numerical results across various datasets, such as Clevr, 3D-Front, and Waymo. These datasets vary in complexity, encompassing indoor and outdoor scenes, each containing multiple objects. DisCoScene achieves state-of-the-art performance in terms of both FID and KID scores, rivalling contemporary 3D-aware models and even matching some 2D image synthesis methods like StyleGAN2 in image quality, but with the added ability for 3D manipulation.

Moreover, DisCoScene maintains computational efficiency through an innovative rendering pipeline that focuses on the valid points within bounding boxes for volumetric rendering, thus reducing computational overhead. The training times and inference speeds per image confirm that DisCoScene is competitive in terms of resource usage and time efficiency.

Practical and Theoretical Implications

Practically, DisCoScene expands the capabilities of 3D-aware models in generating complex scenes with multiple objects, while offering versatile editing capabilities. Users can manipulate the layout of the scene in 3D space, modifying object positions, orientations, and appearances interactively. This is particularly valuable for applications in virtual reality, 3D content creation, and simulations where scene flexibility and control are crucial.

Theoretically, the approach advances the understanding of disentangled representation learning in 3D space. It demonstrates how simple priors like bounding boxes can provide robust guidance for scene synthesis tasks. This work opens new avenues for exploring how other abstract representations might similarly benefit generative models, particularly in scenarios with limited or noisy data typical in real-world applications.

Future Directions

Future developments could focus on integrating more advanced scene priors or semantic annotations to the layout representation, potentially enhancing the semantic depth and realism of the generated scenes. Additionally, research may explore end-to-end learning for layout estimation directly from input data, which could streamline the synthesis process and widen the applicability of the model. Moreover, expanding the model's capability to handle even more diverse and dynamic scenes, potentially incorporating elements like lighting and texture variations, could further increase its utility and flexibility.

In conclusion, DisCoScene offers a compelling advancement in the 3D scene synthesis landscape, combining innovative representation techniques with practical flexibility and strong numerical results. Its ability to disentangle and compose complex scenes from single-view data contributes significantly to the field and provides a robust framework for future research and application developments in AI and computer vision.

PDF Markdown