Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints (2310.03602v3)

Published 5 Oct 2023 in cs.CV

Abstract: Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

PDF HTML Abstract

The paper "Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints" presents a novel method for generating 3D indoor scenes from textual descriptions, addressing challenges related to layout constraints and editability. The proposed approach, termed Ctrl-Room, focuses on creating realistic room layouts and enabling interactive editing of individual elements within the generated scene.

Key Contributions:

Separation of Layout and Appearance Modeling: The approach emphasizes the separation of geometric layout generation from appearance generation. By doing so, the method ensures that the generated 3D spaces align with designer-style layouts while maintaining high visual fidelity for textures and object appearances.
Scene Code Parameterization: Ctrl-Room introduces a novel parameterization of indoor scenes using a "scene code" that encodes each furniture item and architectural element (like walls, doors, and windows) with attributes such as position, size, semantic class, and orientation. This encoding facilitates both the generation and editing of 3D scenes.
Two-Stage Generation Process:
- Layout Generation Stage: Utilizes a diffusion model to learn and generate plausible room layouts based on textual input. The model is trained on the Structured3D dataset and involves a comprehensive encoding of room layouts that includes not only furniture arrangements but also structural elements like walls.
- Appearance Generation Stage: A fine-tuned ControlNet model is employed to generate panoramic images of the rooms, guided by the layout information. This panoramic image encapsulates the room’s appearance and is reconstructed to form a textured 3D mesh.
Interactive Editing Capabilities: The methodology allows for interactive adjustments, such as resizing or moving furniture items. This is achieved through a mask-guided editing module that modifies the panoramic image based on changes in the scene layout, thus updating the final 3D room representation with minimal additional training.
Optimization for Efficient Generation: The project achieves significant efficiency improvements, generating high-quality panoramas and 3D models in a fraction of the time required by previous approaches like MVDiffusion and Text2Room.

Performance Evaluation:

Extensive experiments demonstrate that Ctrl-Room surpasses existing methods in generating view-consistent, semantically plausible, and editable 3D rooms from natural language inputs. Quantitative metrics like FID, CLIP Score, and Inception Score validate the visual and structural quality of the generated scenes. Furthermore, qualitative assessments through user studies highlight the perceptual quality and 3D structural completeness of the mesh models created by Ctrl-Room.

Overall, Ctrl-Room represents a significant contribution to the field of text-driven 3D scene generation, offering a robust solution for producing and dynamically modifying indoor scenes with attention to both geometrical layout and aesthetic detail.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (6)

Chuan Fang (8 papers)
Xiaotao Hu (7 papers)
Kunming Luo (18 papers)
Ping Tan (101 papers)
Yuan Dong (30 papers)
Rakesh Shrestha (4 papers)

Citations (27)

View on Semantic Scholar

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints (2310.03602v3)

Related Papers