Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

Published 20 Dec 2023 in cs.CV | (2312.13324v1)

Abstract: We introduce ShowRoom3D, a three-stage approach for generating high-quality 3D room-scale scenes from texts. Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality. This is primarily attributed to the limitations of 2D priors lacking 3D awareness and constraints in the training methodology. In this paper, we utilize a 3D diffusion prior, MVDiffusion, to optimize the 3D room-scale scene. Our contributions are in two aspects. Firstly, we propose a progressive view selection process to optimize NeRF. This involves dividing the training process into three stages, gradually expanding the camera sampling scope. Secondly, we propose the pose transformation method in the second stage. It will ensure MVDiffusion provide the accurate view guidance. As a result, ShowRoom3D enables the generation of rooms with improved structural integrity, enhanced clarity from any view, reduced content repetition, and higher consistency across different perspectives. Extensive experiments demonstrate that our method, significantly outperforms state-of-the-art approaches by a large margin in terms of user study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. CoRR, abs/2304.04968, 2023.
  2. Efficient geometry-aware 3d generative adversarial networks. In CVPR, pages 16102–16112. IEEE, 2022.
  3. Matterport3d: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676. IEEE Computer Society, 2017.
  4. Set-the-scene: Global-local training for generating controllable nerf scenes. CoRR, abs/2303.13450, 2023.
  5. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 2432–2443. IEEE Computer Society, 2017.
  6. Objaverse: A universe of annotated 3d objects. In CVPR, pages 13142–13153. IEEE, 2023.
  7. GRAM: generative radiance manifolds for 3d-aware image generation. In CVPR, pages 10663–10673. IEEE, 2022.
  8. Scenescape: Text-driven consistent scene generation. CoRR, abs/2302.01133, 2023.
  9. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  10. Text2room: Extracting textured 3d meshes from 2d text-to-image models. CoRR, abs/2303.11989, 2023.
  11. Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net, 2022.
  12. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  13. Simple and effective synthesis of indoor 3d scenes. In AAAI, pages 1169–1178. AAAI Press, 2023.
  14. Magic3d: High-resolution text-to-3d content creation. In CVPR, pages 300–309. IEEE, 2023a.
  15. Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout. CoRR, abs/2303.13843, 2023b.
  16. Devrf: Fast deformable voxel radiance fields for dynamic scenes. Advances in Neural Information Processing Systems, 35:36762–36775, 2022.
  17. Dynvideo-e: Harnessing dynamic nerf for large-scale motion-and view-change human-centric video editing. arXiv preprint arXiv:2310.10624, 2023a.
  18. Hosnerf: Dynamic human-object-scene neural radiance fields from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18483–18494, 2023b.
  19. Zero-1-to-3: Zero-shot one image to 3d object. CoRR, abs/2303.11328, 2023c.
  20. SKED: sketch-guided text-based 3d editing. CoRR, abs/2303.10735, 2023.
  21. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV (1), pages 405–421. Springer, 2020.
  22. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
  23. GIRAFFE: representing scenes as compositional generative neural feature fields. In CVPR, pages 11453–11464. Computer Vision Foundation / IEEE, 2021.
  24. Nerfies: Deformable neural radiance fields. In ICCV, pages 5845–5854. IEEE, 2021.
  25. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952, 2023.
  26. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR. OpenReview.net, 2023.
  27. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327. Computer Vision Foundation / IEEE, 2021.
  28. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. CoRR, abs/2306.17843, 2023.
  29. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  30. Look outside the room: Synthesizing A consistent long-term 3d scene video from A single image. In CVPR, pages 3553–3563. IEEE, 2022.
  31. Geometry-free view synthesis: Transformers and no 3d priors. In ICCV, pages 14336–14346. IEEE, 2021.
  32. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
  33. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  34. Christoph Schuhmann. Clip+mlp aesthetic score predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023.
  35. GRAF: generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
  36. Vox-e: Text-guided voxel editing of 3d objects. CoRR, abs/2303.12048, 2023.
  37. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, pages 6087–6101, 2021.
  38. Mvdream: Multi-view diffusion for 3d generation. CoRR, abs/2308.16512, 2023.
  39. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5449–5459. IEEE, 2022.
  40. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. CoRR, abs/2307.01097, 2023.
  41. Consistent view synthesis with pose-guided diffusion models. In CVPR, pages 16773–16783. IEEE, 2023.
  42. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, pages 12619–12629. IEEE, 2023a.
  43. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. CoRR, abs/2305.16213, 2023b.
  44. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022.
  45. Synsin: End-to-end view synthesis from a single image. In CVPR, pages 7465–7475. Computer Vision Foundation / IEEE, 2020.
  46. PV3D: A 3d generative model for portrait video generation. In ICLR. OpenReview.net, 2023.
  47. Text2nerf: Text-driven 3d scene generation with neural radiance fields. CoRR, abs/2305.11588, 2023.
  48. Stereo magnification: learning view synthesis using multiplane images. ACM Trans. Graph., 37(4):65, 2018.
  49. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. CoRR, abs/2305.18766, 2023.
  50. Dreameditor: Text-driven 3d scene editing with neural fields. CoRR, abs/2306.13455, 2023.
Citations (4)

Summary

  • The paper presents a novel multi-stage training pipeline for text-to-3D room generation using 3D diffusion priors.
  • It employs pose transformation and a CAA module to ensure multi-view consistency and robust scene geometry.
  • Experimental results show improved image clarity and structural integrity with superior CLIP scores over current methods.

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

Introduction

The paper "ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors" presents a method for generating high-quality 3D room-scale scenes from textual descriptions using a novel application of 3D diffusion priors. Traditional methods utilizing 2D diffusion models often struggle with quality and consistency due to lacking 3D awareness. ShowRoom3D leverages MVDiffusion, a model optimized for multi-view consistency, to enhance the generation of 3D scenes. Key contributions include a three-stage training pipeline and pose transformation techniques to ensure accurate view guidance during NeRF optimization, resulting in robust room structures and improved image clarity.

Methodology

Three-Stage Training Pipeline

The proposed method features a three-stage training pipeline that gradually refines the NeRF model by expanding the camera sampling scope:

  • First Stage: The camera is positioned at the center of the room and rotated to generate panoramic views, establishing initial room geometry and structure (Figure 1). This stage ensures comprehensive capture of room details. Figure 1

    Figure 1: The illustration of every stage's camera sampling method. In the initial stage, the camera is fixed at the origin with free rotational capabilities.

  • Second Stage: Cameras are sampled from various positions and oriented outward, optimizing NeRF for better spatial rendering across diverse viewpoints (Figure 2). This stage tackles the geometric refinement and widening of rendering capabilities. Figure 2

    Figure 2: Method overview: showcasing the three-stage training pipeline and the pose transformation module in the second stage.

  • Third Stage: Random sampling of camera positions and rotations at different iterations allows the NeRF model to achieve versatile rendering capabilities for room-scale scenes, providing consistency across arbitrary viewpoints.

To address inaccuracies in MVDiffusion guidance when the camera is not at the origin, pose transformation is employed in the second stage. This ensures an equivalent camera perspective, facilitating accurate multi-view guidance.

MVDiffusion and CAA Module

MVDiffusion is utilized for multi-view consistency, featuring the Correspondence-Aware Attention (CAA) module. This attention mechanism evaluates spatial relationships between differing camera perspectives using positional encoding. The method integrates these features to optimize NeRF models for generating detailed and coherent room-scale scenes.

Experimental Results

Qualitative and Quantitative Comparisons

Extensive experiments demonstrate ShowRoom3D's superiority over state-of-the-art techniques such as DreamFusion and ProlificDreamer. Comparisons depict ShowRoom3D effectively reduces content repetition and enhances structural integrity, offering clear and consistent images without the Janus problem (Figure 3). Figure 3

Figure 3: Qualitative comparisons of ShowRoom3D and state-of-the-art approaches.

Quantitative metrics, including CLIP scores and aesthetic evaluations, underscore ShowRoom3D's performance. It achieves the highest averages in text alignment and aesthetic quality, confirmed by comprehensive user studies highlighting its superior user preference scores across various attributes.

Ablation Studies

Ablation studies on individual components affirm the critical role of multi-stage training and pose transformation. Results demonstrate reduced quality when omitting stages or utilizing singular stage pipelines. The CAA module's impact on style consistency and geometric accuracy is also analyzed, showing substantial improvements in content diversity without style inconsistencies (Figure 4). Figure 4

Figure 4: Ablation study on each proposed component and their impact on rendering quality.

Conclusion

ShowRoom3D offers a robust framework for generating text-based 3D room-scale scenes, leveraging 3D priors to refine neural scene representations through innovative training regimens. Its modular approach can effectively guide future advancements in the synthesis of virtual environments, highlighting the tangible applications in VR, AR, and other immersive technologies. Although it currently faces challenges like oversaturation and time-intensive training processes, ongoing research aims to further optimize these facets.

The implications of ShowRoom3D extend towards more coherent virtual reality experiences and enhanced architectural visualization, promising intensified realism and detail in digital environments. As research progresses, methods like ShowRoom3D are likely to underpin evolving technologies across various domains, fostering richer interactive interfaces and expanding AI's role in transformative visualizations.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.