ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

Published 20 Dec 2023 in cs.CV | (2312.13324v1)

Abstract: We introduce ShowRoom3D, a three-stage approach for generating high-quality 3D room-scale scenes from texts. Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality. This is primarily attributed to the limitations of 2D priors lacking 3D awareness and constraints in the training methodology. In this paper, we utilize a 3D diffusion prior, MVDiffusion, to optimize the 3D room-scale scene. Our contributions are in two aspects. Firstly, we propose a progressive view selection process to optimize NeRF. This involves dividing the training process into three stages, gradually expanding the camera sampling scope. Secondly, we propose the pose transformation method in the second stage. It will ensure MVDiffusion provide the accurate view guidance. As a result, ShowRoom3D enables the generation of rooms with improved structural integrity, enhanced clarity from any view, reduced content repetition, and higher consistency across different perspectives. Extensive experiments demonstrate that our method, significantly outperforms state-of-the-art approaches by a large margin in terms of user study.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (50)

Citations (4)

View on Semantic Scholar

Summary

The paper presents a novel multi-stage training pipeline for text-to-3D room generation using 3D diffusion priors.
It employs pose transformation and a CAA module to ensure multi-view consistency and robust scene geometry.
Experimental results show improved image clarity and structural integrity with superior CLIP scores over current methods.

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

Introduction

The paper "ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors" presents a method for generating high-quality 3D room-scale scenes from textual descriptions using a novel application of 3D diffusion priors. Traditional methods utilizing 2D diffusion models often struggle with quality and consistency due to lacking 3D awareness. ShowRoom3D leverages MVDiffusion, a model optimized for multi-view consistency, to enhance the generation of 3D scenes. Key contributions include a three-stage training pipeline and pose transformation techniques to ensure accurate view guidance during NeRF optimization, resulting in robust room structures and improved image clarity.

Methodology

Three-Stage Training Pipeline

The proposed method features a three-stage training pipeline that gradually refines the NeRF model by expanding the camera sampling scope:

First Stage: The camera is positioned at the center of the room and rotated to generate panoramic views, establishing initial room geometry and structure (Figure 1). This stage ensures comprehensive capture of room details.
Figure 1: The illustration of every stage's camera sampling method. In the initial stage, the camera is fixed at the origin with free rotational capabilities.
Second Stage: Cameras are sampled from various positions and oriented outward, optimizing NeRF for better spatial rendering across diverse viewpoints (Figure 2). This stage tackles the geometric refinement and widening of rendering capabilities.
Figure 2: Method overview: showcasing the three-stage training pipeline and the pose transformation module in the second stage.
Third Stage: Random sampling of camera positions and rotations at different iterations allows the NeRF model to achieve versatile rendering capabilities for room-scale scenes, providing consistency across arbitrary viewpoints.

To address inaccuracies in MVDiffusion guidance when the camera is not at the origin, pose transformation is employed in the second stage. This ensures an equivalent camera perspective, facilitating accurate multi-view guidance.

MVDiffusion and CAA Module

MVDiffusion is utilized for multi-view consistency, featuring the Correspondence-Aware Attention (CAA) module. This attention mechanism evaluates spatial relationships between differing camera perspectives using positional encoding. The method integrates these features to optimize NeRF models for generating detailed and coherent room-scale scenes.

Experimental Results

Qualitative and Quantitative Comparisons

Extensive experiments demonstrate ShowRoom3D's superiority over state-of-the-art techniques such as DreamFusion and ProlificDreamer. Comparisons depict ShowRoom3D effectively reduces content repetition and enhances structural integrity, offering clear and consistent images without the Janus problem (Figure 3).

Figure 3: Qualitative comparisons of ShowRoom3D and state-of-the-art approaches.

Quantitative metrics, including CLIP scores and aesthetic evaluations, underscore ShowRoom3D's performance. It achieves the highest averages in text alignment and aesthetic quality, confirmed by comprehensive user studies highlighting its superior user preference scores across various attributes.

Ablation Studies

Ablation studies on individual components affirm the critical role of multi-stage training and pose transformation. Results demonstrate reduced quality when omitting stages or utilizing singular stage pipelines. The CAA module's impact on style consistency and geometric accuracy is also analyzed, showing substantial improvements in content diversity without style inconsistencies (Figure 4).

Figure 4: Ablation study on each proposed component and their impact on rendering quality.

Conclusion

ShowRoom3D offers a robust framework for generating text-based 3D room-scale scenes, leveraging 3D priors to refine neural scene representations through innovative training regimens. Its modular approach can effectively guide future advancements in the synthesis of virtual environments, highlighting the tangible applications in VR, AR, and other immersive technologies. Although it currently faces challenges like oversaturation and time-intensive training processes, ongoing research aims to further optimize these facets.

The implications of ShowRoom3D extend towards more coherent virtual reality experiences and enhanced architectural visualization, promising intensified realism and detail in digital environments. As research progresses, methods like ShowRoom3D are likely to underpin evolving technologies across various domains, fostering richer interactive interfaces and expanding AI's role in transformative visualizations.

Markdown Report Issue