Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases (2403.09675v1)

Published 5 Feb 2024 in cs.CV and cs.GR

Abstract: We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained LLMs to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-LLMs (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.

The paper presents a system for "open-universe" indoor scene generation from natural language text prompts. Unlike prior work that trains on fixed datasets of 3D scenes with curated object categories, this system can generate a wide variety of room types and incorporate objects beyond a predefined vocabulary. It achieves this by leveraging the world knowledge embedded in LLMs and vision-LLMs (VLMs), combined with a 3D object database that does not require category annotations or consistent alignment.

The core idea is to decompose the complex task of scene generation into several manageable steps:

  1. Scene Program Synthesis: An LLM translates the natural language prompt into a declarative program written in a custom domain-specific language (DSL). This DSL describes the objects in the scene and their spatial relationships using constraints, rather than precise numerical coordinates. This approach is motivated by the observation that LLMs perform better at reasoning about relative spatial relationships than precise metrics.
  2. Layout Optimization: The scene program is converted into a geometric constraint satisfaction problem. A gradient-based optimization scheme solves this problem to determine the positions and orientations of all objects in the scene, producing a structured object layout. The optimizer includes mechanisms to handle potential errors and contradictions in the LLM-generated program and uses "repel forces" to produce plausible layouts that avoid unnecessary clutter or overlap.
  3. Object Retrieval: For each object specified in the generated layout, the system retrieves a suitable 3D mesh from a large, unannotated database (like Objaverse (Deitke et al., 2022 , Deitke et al., 2023 )). This is done using VLMs (specifically, SigLIP [Zhai_2023_ICCV]) to embed both the object's text description and renderings of candidate 3D meshes. A category-aware re-ranking and a multi-object filtering step using a multimodal LLM (GPT4V) are employed to improve retrieval accuracy and ensure that the retrieved mesh matches the desired category and contains only the specified object. The system also filters candidates based on how well their bounding box aspect ratio matches the specified object dimensions.
  4. Object Orientation: The retrieved 3D meshes, which often lack consistent orientation, must be aligned with the scene layout. The system first aligns the mesh's upright direction with the scene's vertical axis based on bounding box aspect ratio distortion. Then, it uses a combination of VLM similarity to the text "the front of a [category]" and a multimodal LLM (GPT4V) to determine which of the four horizontal directions corresponds to the object's front face.

The system evaluates its approach qualitatively and quantitatively. Qualitative results demonstrate the system's ability to generate diverse indoor scenes, including common rooms, rooms for specific activities, stylish rooms, and fantastical spaces, based on open-ended text prompts. Optional inputs like room size and object density can also be controlled. The stochastic nature of the layout optimizer also allows for generating variations of the same scene program.

In quantitative evaluations, the system is compared against prior closed-universe methods (ATISS [Paschalidou2021NEURIPS] and DiffuScene [tang2023diffuscene]) on generating standard room types (bedroom, living room, dining room). A perceptual paper shows that layouts generated by this system are significantly preferred by human participants (79-81% preference) over those produced by the baseline methods, which often suffer from object overlaps and less plausible arrangements.

For open-universe generation, the system is compared against a modified LayoutGPT [feng2023layoutgpt] baseline across various prompt types (basic, completion, style, activity, fantastical, emotion). A perceptual paper shows the proposed system's output is preferred overall (65% preference), particularly for style and emotion prompts. The comparison also highlights LayoutGPT's tendency for object interpenetration, which the proposed constraint-based optimizer avoids. Ablation studies validate the effectiveness of the multi-stage program synthesis pipeline, the category-aware re-ranking and filtering in object retrieval, and the multi-step approach for object orientation.

The authors acknowledge limitations, including restricting rooms to four walls and objects to cardinal orientations. While the system demonstrates promising results, a small qualitative paper indicated that generated scenes, while plausible in basic object grouping, sometimes lacked adherence to professional interior design principles (e.g., circulation space), suggesting avenues for future work, possibly by incorporating such principles into the DSL. The current computational cost is also higher than traditional closed-universe methods, with median scene generation time around 25 minutes, primarily due to repeated LLM API calls for program synthesis, retrieval, and orientation, although caching and model advancements could improve this.

The key contributions are summarized as:

  • A declarative DSL for indoor scene layouts and a gradient-based executor.
  • A prompting workflow using LLMs for synthesizing DSL programs from text.
  • A pipeline using pretrained VLMs for retrieving and orienting 3D meshes from large, unannotated databases.
  • Evaluation protocols and benchmarks for open-universe indoor synthesis.

The paper highlights a practical implementation strategy for open-universe 3D scene generation by effectively combining the strengths of LLMs for high-level reasoning and knowledge, VLMs for visual understanding and retrieval, and traditional optimization techniques for satisfying geometric constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Zero-Shot 3D Shape Correspondence. In SIGGRAPH Asia.
  2. SATR: Zero-Shot Semantic Segmentation of 3D Shapes. In Proceedings of the International Conference on Computer Vision (ICCV).
  3. Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. (2023).
  4. CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout. arXiv:2303.13843 [cs.CV]
  5. Graph Drawing: Algorithms for the Visualization of Graphs (1st ed.). Prentice Hall PTR, USA.
  6. Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  7. Visual Programming for Text-to-Image Generation and Evaluation. In NeurIPS.
  8. Bob Coyne and Richard Sproat. 2001. WordsEye: an automatic text-to-scene conversion system. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01). Association for Computing Machinery, New York, NY, USA, 487–496. https://doi.org/10.1145/383259.383316
  9. 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions. CVPR.
  10. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023).
  11. Objaverse: A Universe of Annotated 3D Objects. arXiv preprint arXiv:2212.08051 (2022).
  12. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 5982–5994. https://proceedings.neurips.cc/paper_files/paper/2022/file/27c546ab1e4f1d7d638e6a8dfbad9a07-Paper-Conference.pdf
  13. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
  14. Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints. arXiv preprint arXiv:2310.03602 (2023).
  15. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=Xu8aG5Q8M3
  16. Example-based synthesis of 3D object arrangements. ACM Transactions on Graphics (TOG) 31, 6 (2012), 135:1–11.
  17. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10933–10942.
  18. Upright orientation of man-made objects. In ACM SIGGRAPH 2008 Papers (Los Angeles, California) (SIGGRAPH ’08). Association for Computing Machinery, New York, NY, USA, Article 42, 7 pages. https://doi.org/10.1145/1399504.1360641
  19. GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs. arXiv 2312.00093 (2023).
  20. SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation with Fine-Grained Geometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023), 1–18. https://doi.org/10.1109/TPAMI.2023.3237577
  21. OpenAI GPT-4. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  22. Learning Interpretable Libraries by Compressing and Documenting Code. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023. https://openreview.net/forum?id=4gYLottfsf
  23. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual Programming: Compositional Visual Reasoning Without Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14953–14962.
  24. CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs. arXiv preprint arXiv:2311.16703 (2023).
  25. Jordan Hobbs. 2024. Why IKEA Uses 3D Renders vs. Photography for Their Furniture Catalog. https://www.cadcrowd.com/blog/why-ikea-uses-3d-renders-vs-photography-for-their-furniture-catalog/. Accessed: 2024-01-19.
  26. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7909–7920.
  27. Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions. arXiv preprint arXiv:2306.06212 (2023).
  28. Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798 [cs.CL]
  29. Zero-Shot Text-Guided Object Generation with Dream Fields. (2022).
  30. Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions. arXiv:2305.02463 [cs.CV]
  31. Learning 3D Scene Synthesis from Annotated RGB-D Images. In Computer Graphics Forum, Vol. 35. 197–206.
  32. GRAINS: Generative Recursive Autoencoders for INdoor Scenes. CoRR arXiv:1807.09193 (2018).
  33. Competition-level code generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. https://doi.org/10.1126/science.abq1158
  34. Automatic Data-Driven Room Design Generation. In Next Generation Computer Animation Techniques, Jian Chang, Jian Jun Zhang, Nadia Magnenat Thalmann, Shi-Min Hu, Ruofeng Tong, and Wencheng Wang (Eds.). Springer International Publishing, Cham, 133–148.
  35. Magic3D: High-Resolution Text-to-3D Content Creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  36. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21736–21746.
  37. ATT3D: Amortized Text-To-3D Object Synthesis. arXiv (2023).
  38. Scalable 3D Captioning with Pretrained Models. arXiv preprint arXiv:2306.07279 (2023).
  39. How Can Large Language Models Help Humans in Design and Manufacturing? arXiv:2307.14377 [cs.CL]
  40. Interactive furniture layout using interior design guidelines. In ACM SIGGRAPH 2011 Papers (Vancouver, British Columbia, Canada) (SIGGRAPH ’11). Association for Computing Machinery, New York, NY, USA, Article 87, 10 pages. https://doi.org/10.1145/1964921.1964982
  41. 4M: Massively Multimodal Masked Modeling. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=TegmlsD8oQ
  42. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv:2212.08751 [cs.CV]
  43. ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
  44. Planner5d. 2024. Planner5d: House Design Software. https://planner5d.com. Accessed: 2024-01-19.
  45. DreamFusion: Text-to-3D using 2D Diffusion. arXiv (2022).
  46. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots.
  47. Human-centric Indoor Scene Synthesis Using Stochastic Grammar. In Conference on Computer Vision and Pattern Recognition (CVPR).
  48. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
  49. Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In CVPR 2019.
  50. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  51. Mathematical discoveries from program search with large language models. Nature (2023). https://doi.org/10.1038/s41586-023-06924-6
  52. RoomSketcher. 2024. Create Floor Plans and Home Designs Online. http://www.roomsketcher.com. Accessed: 2024-01-19.
  53. ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  54. CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18603–18613.
  55. CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  56. ControlRoom3D: Room Generation using Semantic Proxy Rooms. arXiv:2312.05208 (2023).
  57. 3D-GPT: Procedural 3D Modeling with Large Language Models. arXiv:2310.12945 [cs.CV]
  58. ViperGPT: Visual Inference via Python Execution for Reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV) (2023).
  59. DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis. In arxiv.
  60. Target. 2024. Home Planner. https://www.target.com/room-planner/home. Accessed: 2024-01-19.
  61. Solving Olympiad Geometry without Human Demonstrations. Nature (2024). https://doi.org/10.1038/s41586-023-06747-5
  62. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG) 38, 4 (2019), 132.
  63. Deep Convolutional Priors for Indoor Scene Synthesis. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
  64. SceneFormer: Indoor Scene Generation with Transformers. arXiv preprint arXiv:2012.09793 (2020).
  65. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. In Advances in Neural Information Processing Systems (NeurIPS).
  66. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS.
  67. ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding. In CVPR 2023.
  68. Habitat-Matterport 3D Semantics Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4927–4936.
  69. Holodeck: Language Guided Generation of 3D Embodied AI Environments. arXiv preprint arXiv:2312.09067 (2023).
  70. Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. 31, 4, Article 56 (jul 2012), 11 pages. https://doi.org/10.1145/2185520.2185552
  71. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. arXiv preprint arXiv:2310.08529 (2023).
  72. Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG) 30, 4 (2011), 86:1–12.
  73. Sigmoid Loss for Language Image Pre-Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11975–11986.
  74. Sigmoid Loss for Language Image Pre-Training. In ICLR 2023.
  75. Deep Generative Modeling for Scene Synthesis via Hybrid Representations. CoRR abs/1808.02084 (2018). arXiv:1808.02084 http://arxiv.org/abs/1808.02084
  76. PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation. arXiv:2312.03015 [cs.CV]
  77. SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation. In IEEE Conference on Computer Vision (ICCV).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Rio Aguina-Kang (3 papers)
  2. Maxim Gumin (2 papers)
  3. Do Heon Han (2 papers)
  4. Stewart Morris (2 papers)
  5. Seung Jean Yoo (2 papers)
  6. Aditya Ganeshan (15 papers)
  7. R. Kenny Jones (16 papers)
  8. Qiuhong Anna Wei (2 papers)
  9. Kailiang Fu (1 paper)
  10. Daniel Ritchie (50 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets