The paper presents a system for "open-universe" indoor scene generation from natural language text prompts. Unlike prior work that trains on fixed datasets of 3D scenes with curated object categories, this system can generate a wide variety of room types and incorporate objects beyond a predefined vocabulary. It achieves this by leveraging the world knowledge embedded in LLMs and vision-LLMs (VLMs), combined with a 3D object database that does not require category annotations or consistent alignment.
The core idea is to decompose the complex task of scene generation into several manageable steps:
- Scene Program Synthesis: An LLM translates the natural language prompt into a declarative program written in a custom domain-specific language (DSL). This DSL describes the objects in the scene and their spatial relationships using constraints, rather than precise numerical coordinates. This approach is motivated by the observation that LLMs perform better at reasoning about relative spatial relationships than precise metrics.
- Layout Optimization: The scene program is converted into a geometric constraint satisfaction problem. A gradient-based optimization scheme solves this problem to determine the positions and orientations of all objects in the scene, producing a structured object layout. The optimizer includes mechanisms to handle potential errors and contradictions in the LLM-generated program and uses "repel forces" to produce plausible layouts that avoid unnecessary clutter or overlap.
- Object Retrieval: For each object specified in the generated layout, the system retrieves a suitable 3D mesh from a large, unannotated database (like Objaverse (Deitke et al., 2022 , Deitke et al., 2023 )). This is done using VLMs (specifically, SigLIP [Zhai_2023_ICCV]) to embed both the object's text description and renderings of candidate 3D meshes. A category-aware re-ranking and a multi-object filtering step using a multimodal LLM (GPT4V) are employed to improve retrieval accuracy and ensure that the retrieved mesh matches the desired category and contains only the specified object. The system also filters candidates based on how well their bounding box aspect ratio matches the specified object dimensions.
- Object Orientation: The retrieved 3D meshes, which often lack consistent orientation, must be aligned with the scene layout. The system first aligns the mesh's upright direction with the scene's vertical axis based on bounding box aspect ratio distortion. Then, it uses a combination of VLM similarity to the text "the front of a [category]" and a multimodal LLM (GPT4V) to determine which of the four horizontal directions corresponds to the object's front face.
The system evaluates its approach qualitatively and quantitatively. Qualitative results demonstrate the system's ability to generate diverse indoor scenes, including common rooms, rooms for specific activities, stylish rooms, and fantastical spaces, based on open-ended text prompts. Optional inputs like room size and object density can also be controlled. The stochastic nature of the layout optimizer also allows for generating variations of the same scene program.
In quantitative evaluations, the system is compared against prior closed-universe methods (ATISS [Paschalidou2021NEURIPS] and DiffuScene [tang2023diffuscene]) on generating standard room types (bedroom, living room, dining room). A perceptual paper shows that layouts generated by this system are significantly preferred by human participants (79-81% preference) over those produced by the baseline methods, which often suffer from object overlaps and less plausible arrangements.
For open-universe generation, the system is compared against a modified LayoutGPT [feng2023layoutgpt] baseline across various prompt types (basic, completion, style, activity, fantastical, emotion). A perceptual paper shows the proposed system's output is preferred overall (65% preference), particularly for style and emotion prompts. The comparison also highlights LayoutGPT's tendency for object interpenetration, which the proposed constraint-based optimizer avoids. Ablation studies validate the effectiveness of the multi-stage program synthesis pipeline, the category-aware re-ranking and filtering in object retrieval, and the multi-step approach for object orientation.
The authors acknowledge limitations, including restricting rooms to four walls and objects to cardinal orientations. While the system demonstrates promising results, a small qualitative paper indicated that generated scenes, while plausible in basic object grouping, sometimes lacked adherence to professional interior design principles (e.g., circulation space), suggesting avenues for future work, possibly by incorporating such principles into the DSL. The current computational cost is also higher than traditional closed-universe methods, with median scene generation time around 25 minutes, primarily due to repeated LLM API calls for program synthesis, retrieval, and orientation, although caching and model advancements could improve this.
The key contributions are summarized as:
- A declarative DSL for indoor scene layouts and a gradient-based executor.
- A prompting workflow using LLMs for synthesizing DSL programs from text.
- A pipeline using pretrained VLMs for retrieving and orienting 3D meshes from large, unannotated databases.
- Evaluation protocols and benchmarks for open-universe indoor synthesis.
The paper highlights a practical implementation strategy for open-universe 3D scene generation by effectively combining the strengths of LLMs for high-level reasoning and knowledge, VLMs for visual understanding and retrieval, and traditional optimization techniques for satisfying geometric constraints.