The paper "Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints" presents a novel method for generating 3D indoor scenes from textual descriptions, addressing challenges related to layout constraints and editability. The proposed approach, termed Ctrl-Room, focuses on creating realistic room layouts and enabling interactive editing of individual elements within the generated scene.
Key Contributions:
- Separation of Layout and Appearance Modeling: The approach emphasizes the separation of geometric layout generation from appearance generation. By doing so, the method ensures that the generated 3D spaces align with designer-style layouts while maintaining high visual fidelity for textures and object appearances.
- Scene Code Parameterization: Ctrl-Room introduces a novel parameterization of indoor scenes using a "scene code" that encodes each furniture item and architectural element (like walls, doors, and windows) with attributes such as position, size, semantic class, and orientation. This encoding facilitates both the generation and editing of 3D scenes.
- Two-Stage Generation Process:
- Layout Generation Stage: Utilizes a diffusion model to learn and generate plausible room layouts based on textual input. The model is trained on the Structured3D dataset and involves a comprehensive encoding of room layouts that includes not only furniture arrangements but also structural elements like walls.
- Appearance Generation Stage: A fine-tuned ControlNet model is employed to generate panoramic images of the rooms, guided by the layout information. This panoramic image encapsulates the room’s appearance and is reconstructed to form a textured 3D mesh.
- Interactive Editing Capabilities: The methodology allows for interactive adjustments, such as resizing or moving furniture items. This is achieved through a mask-guided editing module that modifies the panoramic image based on changes in the scene layout, thus updating the final 3D room representation with minimal additional training.
- Optimization for Efficient Generation: The project achieves significant efficiency improvements, generating high-quality panoramas and 3D models in a fraction of the time required by previous approaches like MVDiffusion and Text2Room.
Performance Evaluation:
Extensive experiments demonstrate that Ctrl-Room surpasses existing methods in generating view-consistent, semantically plausible, and editable 3D rooms from natural language inputs. Quantitative metrics like FID, CLIP Score, and Inception Score validate the visual and structural quality of the generated scenes. Furthermore, qualitative assessments through user studies highlight the perceptual quality and 3D structural completeness of the mesh models created by Ctrl-Room.
Overall, Ctrl-Room represents a significant contribution to the field of text-driven 3D scene generation, offering a robust solution for producing and dynamically modifying indoor scenes with attention to both geometrical layout and aesthetic detail.