Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints (2310.03602v3)

Published 5 Oct 2023 in cs.CV

Abstract: Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

The paper "Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints" presents a novel method for generating 3D indoor scenes from textual descriptions, addressing challenges related to layout constraints and editability. The proposed approach, termed Ctrl-Room, focuses on creating realistic room layouts and enabling interactive editing of individual elements within the generated scene.

Key Contributions:

  1. Separation of Layout and Appearance Modeling: The approach emphasizes the separation of geometric layout generation from appearance generation. By doing so, the method ensures that the generated 3D spaces align with designer-style layouts while maintaining high visual fidelity for textures and object appearances.
  2. Scene Code Parameterization: Ctrl-Room introduces a novel parameterization of indoor scenes using a "scene code" that encodes each furniture item and architectural element (like walls, doors, and windows) with attributes such as position, size, semantic class, and orientation. This encoding facilitates both the generation and editing of 3D scenes.
  3. Two-Stage Generation Process:
    • Layout Generation Stage: Utilizes a diffusion model to learn and generate plausible room layouts based on textual input. The model is trained on the Structured3D dataset and involves a comprehensive encoding of room layouts that includes not only furniture arrangements but also structural elements like walls.
    • Appearance Generation Stage: A fine-tuned ControlNet model is employed to generate panoramic images of the rooms, guided by the layout information. This panoramic image encapsulates the room’s appearance and is reconstructed to form a textured 3D mesh.
  4. Interactive Editing Capabilities: The methodology allows for interactive adjustments, such as resizing or moving furniture items. This is achieved through a mask-guided editing module that modifies the panoramic image based on changes in the scene layout, thus updating the final 3D room representation with minimal additional training.
  5. Optimization for Efficient Generation: The project achieves significant efficiency improvements, generating high-quality panoramas and 3D models in a fraction of the time required by previous approaches like MVDiffusion and Text2Room.

Performance Evaluation:

Extensive experiments demonstrate that Ctrl-Room surpasses existing methods in generating view-consistent, semantically plausible, and editable 3D rooms from natural language inputs. Quantitative metrics like FID, CLIP Score, and Inception Score validate the visual and structural quality of the generated scenes. Furthermore, qualitative assessments through user studies highlight the perceptual quality and 3D structural completeness of the mesh models created by Ctrl-Room.

Overall, Ctrl-Room represents a significant contribution to the field of text-driven 3D scene generation, offering a robust solution for producing and dynamically modifying indoor scenes with attention to both geometrical layout and aesthetic detail.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gaudi: A neural architect for immersive 3d scene generation. Proc. NeurIPS, 35:25102–25116, 2022.
  2. Text2shape: Generating shapes from natural language by learning joint embeddings. In Proc. ACCV, pp.  100–116. Springer, 2019.
  3. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023.
  4. Text2light: Zero-shot text-driven hdr panorama generation. ACM Trans. Graphics, 41(6):1–16, 2022.
  5. github. Controlnetgithubmodel. https://github.com/lllyasviel/ControlNet-v1-1-nightly#controlnet-11-segmentation, 2023.
  6. Layouttransformer: Layout generation and completion with self-attention. In Proc. ICCV, pp.  1004–1014, 2021.
  7. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proc. NeurIPS, 30, 2017.
  9. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023.
  10. Layoutvae: Stochastic scene layout generation from a label set. In Proc. ICCV, pp.  9895–9904, 2019.
  11. Imagic: Text-based real image editing with diffusion models. In Proc. CVPR, pp.  6007–6017, 2023.
  12. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, pp.  0, 2006.
  13. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767, 2019.
  14. Magic3d: High-resolution text-to-3d content creation. In Proc. CVPR, pp.  300–309, 2023.
  15. Coco-gan: Generation by parts via conditional coordinating. In Proc. ICCV, pp.  4512–4521, 2019.
  16. Infinitygan: Towards infinite-pixel image synthesis. arXiv preprint arXiv:2104.03963, 2021.
  17. Guided image synthesis via initial image editing in diffusion model. arXiv preprint arXiv:2305.03382, 2023.
  18. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  19. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
  20. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  21. Atiss: Autoregressive transformers for indoor scene synthesis. Proc. NeurIPS, 34:12013–12026, 2021.
  22. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  23. Learning transferable visual models from natural language supervision. In Proc. ICML, pp.  8748–8763. PMLR, 2021.
  24. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pp.  10684–10695, 2022.
  25. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  26. Photorealistic text-to-image diffusion models with deep language understanding. Proc. NeurIPS, 35:36479–36494, 2022.
  27. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  28. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  29. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Proc. NeurIPS, 34:6087–6101, 2021.
  30. Panoformer: Panorama transformer for indoor 360 depth estimation. In Proc. ECCV, pp.  195–211. Springer, 2022.
  31. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  32. Conditional 360-degree image synthesis for immersive indoor scene decoration. arXiv preprint arXiv:2307.09621, 2023.
  33. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In Proc. CVPR, pp.  1047–1056, 2019.
  34. Diffuscene: Scene graph denoising diffusion probabilistic model for generative indoor scene synthesis. arXiv preprint arXiv:2303.14207, 2023a.
  35. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023b.
  36. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023c.
  37. Let there be color! large-scale texturing of 3d reconstructions. In Proc. ECCV, pp.  836–850. Springer, 2014.
  38. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proc. CVPR, pp.  12619–12629, 2023a.
  39. Sceneformer: Indoor scene generation with transformers. In 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE, 2021.
  40. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  41. Ipo-ldm: Depth-aided 360-degree indoor rgb panorama outpainting via latent diffusion model. arXiv preprint arXiv:2307.03177, 2023.
  42. Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588, 2023.
  43. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  44. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proc. ECCV, pp.  519–535. Springer, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chuan Fang (8 papers)
  2. Xiaotao Hu (7 papers)
  3. Kunming Luo (18 papers)
  4. Ping Tan (101 papers)
  5. Yuan Dong (30 papers)
  6. Rakesh Shrestha (4 papers)
Citations (27)