SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections (2302.01330v3)

Published 2 Feb 2023 in cs.CV and cs.GR

Abstract: In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.

Authors (3)

Zhaoxi Chen (49 papers)
Guangcong Wang (25 papers)
Ziwei Liu (368 papers)

Citations (45)

View on Semantic Scholar

Summary

The paper introduces SceneDreamer, a generative model that builds unbounded 3D scenes from 2D images without 3D annotations.
It utilizes a bird's-eye view representation and semantic-aware neural hash grid to efficiently encode complex scene features.
Experimental results demonstrate that the approach yields multi-view-consistent, photorealistic 3D landscapes with free camera movement.

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

The paper presents SceneDreamer, an unconditional generative model aimed at synthesizing unbounded 3D scenes from 2D image collections, without relying on any 3D annotations. This approach is significant as it addresses the challenges in generating large-scale 3D content efficiently by leveraging only in-the-wild 2D images, which commonly lack 3D semantic alignment or camera pose information.

Core Methodology

SceneDreamer introduces a structured framework consisting of three primary components: a bird's-eye-view (BEV) representation, a semantic-aware generative parameterization, and a volumetric renderer. The BEV representation is central to the model's efficiency, encoding 3D scenes using a height field and a semantic field. This approach allows for a quadratic complexity in scene representation, enabling efficient processing even for expansive landscapes.

A novel feature is the introduction of a semantic-aware neural hash grid designed to parameterize latent features based on 3D positions and scene semantics. This parameterization supports the generation of versatile and generalizable landscape features across numerous scenes.

The model employs a neural volumetric renderer, which leverages adversarial training on 2D images to render photorealistic outputs. By doing so, SceneDreamer achieves a nuanced integration of 3D geometry and scene semantics to generate vivid and diverse unbounded 3D worlds.

Experimental Validation

Experiments demonstrate that SceneDreamer advances beyond current state-of-the-art methods in synthesizing multi-view-consistent and photorealistic 3D landscapes. The model's ability to produce realistic 3D scenes with free camera movement is evidenced by the high-quality images and depth consistency observed in the rendered outputs.

Implications and Future Perspectives

SceneDreamer’s approach to unbounded scene generation implies significant potential for applications in areas where realistic and scalable 3D content is desired, such as in gaming, virtual reality, and the metaverse. The paper sets the groundwork for future explorations in 3D generative models, suggesting potential enhancements in procedural scene generation and extension to other forms of 3D content.

Future developments could explore integration with more advanced procedural generation techniques or machine learning methods for terrain modeling. Additionally, refining the camera pose sampling strategy and further reducing computational resource requirements could render SceneDreamer even more versatile and efficient.

In conclusion, SceneDreamer presents a methodologically robust and computationally efficient solution for generating large-scale, diverse 3D scenes from simple 2D image inputs, representing a compelling advancement in the domain of 3D scene generation.