DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting (2404.06903v2)

Published 10 Apr 2024 in cs.CV and cs.AI

Abstract: The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/

PDF HTML Abstract

DreamScene360: Elevating Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Introduction

Within the domain of virtual and mixed reality, the efficient creation of immersive 3D environments from textual descriptions is a sought-after goal, bridging the gap between human linguistic capability and computer-generated virtual worlds. DreamScene360 introduces a pioneering pipeline for the generation of 360° 3D scenes directly from textual prompts, leveraging the combination of 2D to 3D conversion methodologies and advanced panorama generation techniques. This work signifies a notable advancement in the field by enabling the creation of detailed, globally consistent 3D scenes with broad applicability from VR/MR to gaming and design.

Background and Motivation

The generation of 3D content from textual descriptions poses significant challenges, exacerbated by the sparse availability of annotated 3D data and the complexity of rendering fully immersive scenes. Traditional methods often fall short, offering either limited scene coverage or sacrificing detail and global consistency. DreamScene360 addresses these issues through innovative use of panoramic images as an intermediate representation, enabling comprehensive scene coverage and high-detail generation with less manual effort.

Technical Approach

Text to 360° Panoramas with Self-Refinement

The framework commences with the generation of a high-quality 360° panoramic image from a text prompt. This process employs a diffusion model capable of producing panoramas that ensure seamless transition across the image borders. A stitch method guarantees the panorama’s continuity, crucial for the subsequent 3D translation phase. The implementation of self-refinement, facilitated by GPT-4V, iteratively enhances prompt formulation based on the visual quality and semantic alignment of generated drafts, significantly optimizing the input for the panorama creation stage.

From Panorama to 3D Scene

Subsequent to panorama generation, DreamScene360 employs a robust process to transform this 2D representation into a 3D scene. This involves initializing a geometric field optimized against a monocular depth estimate, offering a foundational scaffold from which 3D geometry can be derived. The process further corrects and refines this geometry, addressing the challenges posed by single-view depth estimation limitations and enhancing the scene’s spatial coherence and depth accuracy.

Optimizing 3D Gaussian Representations

The core of DreamScene360's 3D scene rendition lies in the optimization of 3D Gaussian splatting—a technique that models 3D space with probabilistic densities, allowing for flexible and efficient rendering. This model is refined through the application of synthetic views, emulating parallax and enhancing depth perception. Semantic and geometric considerations ensure the generated scene remains faithful to the panorama across different viewpoints, tackling the challenge of incomplete data inherent to single-view inputs.

Contributions and Findings

DreamScene360 presents several key contributions to the field of 3D content generation:

A novel pipeline for generating immersive 360° scenes from text inputs, utilizing panoramic images to ensure global scene consistency.
Integration of a self-refinement process that enhances text prompts through iterative feedback, optimizing panorama quality without extensive manual effort.
A robust technique for transforming panoramas into detailed 3D scenes, incorporating advanced Gaussian splatting to maintain visual and geometric fidelity.
Validation of the proposed method against state-of-the-art alternatives, demonstrating superior capability in rendering detailed, consistent 3D environments with wide-ranging applicability.

Implications and Future Work

DreamScene360's methodology significantly lowers the barriers to high-quality 3D scene generation, enabling more intuitive creation processes for VR, gaming, and simulation applications. The use of panoramas as an intermediary format presents a promising direction for future research, potentially unlocking more efficient workflows and higher fidelity in 3D content generation.

Despite its advancements, DreamScene360 confronts limitations, notably in its reliance on the resolution of the underlying text-to-image models. Future explorations may delve into enhancing resolution and detail further, as well as expanding the method's adaptability to varied scene types and complexity levels.

Conclusion

DreamScene360 stands as a significant stride forward in the text-to-3D domain, offering an innovative solution to the generation of intricate, visually coherent 3D scenes from textual descriptions. Through its sophisticated use of panoramic imaging and Gaussian splatting, alongside a novel self-refinement process, it sets a new standard for the creation of virtual environments, heralding a new era in the seamless integration of linguistic creativity and digital visualization technologies.