Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting (2405.05768v1)

Published 9 May 2024 in cs.CV

Abstract: Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  2. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  3. Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision, 2017.
  4. View interpolation for image synthesis. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, pages 279–288, 1993.
  5. Text2light: Zero-shot text-driven hdr panorama generation. ACM Transactions on Graphics, 41(6):1–16, 2022.
  6. Testnerf: text-driven 3d style transfer via cross-modal learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 5788–5796, 2023.
  7. Panogrf: Generalizable spherical radiance fields for wide-baseline panoramas. Advances in Neural Information Processing Systems, 36, 2024.
  8. Set-the-scene: Global-local training for generating controllable nerf scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2920–2929, 2023.
  9. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pages 11–20, 1996.
  10. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  11. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602, 2023.
  12. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models. arXiv preprint arXiv:2311.13141, 2023.
  13. Scenescape: Text-driven consistent scene generation. Advances in Neural Information Processing Systems, 36, 2024.
  14. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
  15. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
  16. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7909–7920, October 2023.
  17. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
  18. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, page 0, 2006.
  19. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  20. 360fusionnerf: Panoramic neural radiance fields with joint guidance. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7202–7209. IEEE, 2023.
  21. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  22. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pages 405–421. Springer, 2020.
  23. Reference-guided controllable inpainting of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  24. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012.
  25. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
  26. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, 2022.
  27. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  29. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6229–6238, 2022.
  30. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
  31. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  32. Photo tourism: exploring photo collections in 3d. In ACM SIGGRAPH, pages 835–846, 2006.
  33. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  34. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
  35. Let there be color! large-scale texturing of 3d reconstructions. In European Conference on Computer Vision, pages 836–850. Springer, 2014.
  36. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  37. Perf: Panoramic neural radiance field from a single panorama. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  38. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4933–4943, 2024.
  39. Sinnerf: Training neural radiance fields on complex scenes from a single image. In European Conference on Computer Vision, pages 736–753. Springer, 2022.
  40. Egformer: Equirectangular geometry-biased transformer for 360 depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6101–6112, 2023.
  41. Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics, 2022.
  42. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  43. Text2nerf: Text-driven 3d scene generation with neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2024.
  44. Structured3d: A large photo-realistic dataset for structured 3d modeling. In European Conference on Computer Vision, pages 519–535. Springer, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yikun Ma (3 papers)
  2. Dandan Zhan (1 paper)
  3. Zhi Jin (160 papers)
Citations (4)

Summary

Fast and Consistent 3D Scene Generation from Text Descriptions

Introduction

Generating 3D indoor scenes from text descriptions presents a multitude of applications across various fields like gaming, AR/VR, and smart home designs. While the transformation of text to 3D objects has seen substantial improvements, creating entire 3D scenes remains challenging due to the complexities involved in ensuring realism and consistency over large spatial compositions. Existing methodologies often sacrifice speed, user convenience, or scene fidelity. FastScene is a new framework designed to address these limitations, providing a faster and more cohesive solution to generate high-quality 3D scenes based on textual inputs.

Key Challenges in Scene Generation

Generating complex 3D scenes from text prompts necessitates overcoming several challenges:

  • Speed and Efficiency: Traditional methods, while possibly robust, require long processing times, making them impractical for real-time applications.
  • Scene Consistency: Ensuring that the generated scenes do not just look realistic from single viewpoints, but maintain consistency when observed from varying perspectives.
  • User Convenience: Simplifying the generation process to avoid the need for manually tweaking intricate parameters by end-users.

FastScene: A New Approach to Text-driven 3D Scene Generation

Overview of FastScene

FastScene introduces an efficient, structured process for indoor scene generation that entails three primary phases:

  1. Panorama Generation: Starts with creating a panoramic view, which offers a 360-degree overview of the entire scene. This method captures comprehensive spatial information and aids in maintaining consistency across the scene.
  2. View Synthesis and Inpainting: Applies novel techniques for view synthesis (Coarse View Synthesis or CVS) and inpainting (Progressive Novel View Inpainting or PNVI) to effectively generate and refine views from different perspectives, filling in visual gaps without noticeable distortions.
  3. 3D Reconstruction: Utilizes Multi-View Projection (MVP) and 3D Gaussian Splatting (3DGS) for reconstructing the scene in three dimensions from the generated panoramic views.

Detailed Innovations

  • CVS and PNVI Methods: These strategies innovatively handle the generation of new views with missing parts by progressively inpainting these gaps. This method helps in managing large-distance view changes more gracefully, preventing accumulative distortions.
  • Panorama to Multi-View Processing: By transforming panoramic images into multi-perspective views, FastScene adapts standard 3D modeling tools (like 3DGS) for scene creation without the complex recalibration that panoramas would typically require.

Implications and Future Horizons

Practical Applications

The ability to rapidly generate 3D models from simple text inputs can significantly transform industries such as interior design, gaming, and virtual reality, offering a quick way to prototype environments without deep technical expertise in 3D modeling.

Theoretical Contributions

FastScene represents a significant advancement in handling panoramic data and text-to-3D transformations, showing how integration of different AI techniques can solve complex spatial and perceptual challenges efficiently.

Future Developments

Continued advances in AI and machine learning could lead to even faster processing times and more detailed, dynamically interactive 3D environments generated from even more succinct descriptions. Exploring the integration of FastScene's capabilities with real-time user interactions in VR could also be a potential area for further research.

Conclusion

FastScene sets itself apart by not only focusing on the speed and quality of the generated 3D scenes but also ensuring that these virtual constructions remain consistent across different viewpoints and user interactions. Its application can make the generation of digital environments more accessible and significantly quicker, pushing the boundaries of what can be automatically created from minimal input.