RoomPlanner: Automated Text-to-3D Room Generation
- RoomPlanner is a text-driven 3D scene generation system that creates detailed indoor environments from brief descriptions.
- It employs a cascade of five hierarchical LLM planners with strict spatial and semantic constraints to produce geometrically rational, editable layouts.
- The framework integrates point cloud construction and differentiable 3D rendering, achieving high visual quality with generation times under 30 minutes.
RoomPlanner is a fully automated framework for text-driven 3D room generation, designed to synthesize realistic and rational indoor scenes—comprising topology, architectural envelope, and object layouts—from brief and potentially ambiguous textual descriptions. Distinct from prior work requiring manual layout or panoramic controls, RoomPlanner decomposes the scene specification and synthesis pipeline into hierarchically organized language-driven planners and applies explicit, verifiable spatial and semantic constraints at every stage. Its design yields geometrically rational, editable scenes in under thirty minutes and demonstrates state-of-the-art visual quality and editability for LLM-based 3D scene synthesis (Sun et al., 21 Nov 2025).
1. Hierarchical Language-Driven Agent Planners
RoomPlanner utilizes a cascade of five structured LLM modules, each with purpose-specific prompt engineering and hard-coded architectural constraints that are enforced through templated NL-to-JSON mappings. The decomposition is as follows:
- Floor & Wall-Height Planner: Produces a set of rectangular room polygons , wall height , and per-room material/color assignments. Constraints: axis-aligned rectangles, , , rooms must be simply connected and adjacent via full edge sharing. Walls are zero-thickness in layout.
- Doorway Planner: Given room adjacency and size, generates door/passage/connection tuples, enforcing connection types (doorframe/doorway/open), widths (1–2 m), and exterior access, matching room design styles.
- Window Planner: For each wall, outputs (direction, type [fixed/hung/slider], dimensions from catalog, count, sill base height within [50,120] cm). All windows in a given room share type/size.
- Object Grounding Planner: Suggests and grounds semantic objects per room, stating type, support (floor/wall), size tuple , quantity, and arrangement strategies. It eliminates unsuggested categories (e.g., rugs, mats, windows, doors, curtains, ceiling fixtures) to prevent semantic collisions and omits over- or under-furnishing via prompt refinement. Two-phase output: first a natural-language justification, then structured schema.
- Object Alignment Prompt: Synthesizes a succinct style prompt (≤35 words) capturing material/atmosphere for asset retrieval during downstream geometry and texture instantiation.
Logical sequencing is implemented as a pure functional composition (see the provided pseudocode), enabling deterministic propagation of structured outputs.
2. Explicit Layout Criteria and Arrangement Constraints
All spatial reasoning is governed by non-overlap and accessibility constraints, described explicitly:
- Non-Overlap: For rectangles , for all , except for exactly edge-sharing adjacencies, which must coincide at a segment of nonzero length, prohibiting volumetric intrusions.
- Connectivity: The room adjacency graph — rooms plus "exterior", = shared-wall + door—is required to be connected, guaranteeing direct or indirect access throughout the floorplan.
- Size Bounds & Shape: Dimensions per room are clamped to $3$–$8$ m in any direction (), polygon area , no free-standing or isolated rooms, and rectangles are axis aligned.
- Collision Avoidance: Object arrangements produced by LLM modules observe anti-overlap and accessibility constraints through prompt logic, not via explicit numerical optimization.
While the constraints are enforced through upstream LLM prompt engineering rather than numerical solvers, the effective search space of feasible room and object placements remains tightly bounded.
3. Scene Initialization: Point Cloud Construction
Once the planners produce a complete description, the system constructs an initial 3D point-cloud as the basis for geometry and subsequent neural rendering:
- For each axis-aligned room rectangle, a wall/floor mesh is generated and sampled at regular intervals to produce occupancy points.
- For each grounded object, a CAD model is retrieved, rescaled, and surface-sampled to add points reflecting that object's real-scale positioning.
- Resulting sets are aggregated into a unified scene-level point cloud .
The specific data structures and algorithms (e.g., whether k-d trees or voxel hashing are used for efficient proximity testing) are not detailed. This point cloud forms the input substrate for scene representation and synthesis.
4. Differentiable 3D Scene Representation and Rendering
Following point cloud construction, RoomPlanner builds a neural 3D scene representation leveraging a coarse 3D Gaussian field (formulation not detailed in the available supplement; anticipated to be akin to recent 3D Gaussian Splatting/NGP paradigms).
For downstream camera supervision and scene exploration, the framework employs the novel AnyReach sampling strategy for camera trajectories:
- Zoom-In: Camera moves from the doorway cell along paths to successive furniture centroids . Each camera position is expressed as
targeting and ensuring traversability using a binary occupancy grid .
- Zoom-Out: Camera follows a spiral (spherical) trajectory around the global scene centroid, with view vectors aimed at the nearest object center.
- Hybrid: Mixes segments of spiral and trajectories, switching modes based on reachability and free space confirmations ().
- Interval Timestep Flow Sampling (ITFS): A rendering-time optimization for scene field updates is referenced but details are not provided.
These diverse camera trajectories are essential for efficient optimization of the field representation and effective scene viewing.
5. Pipeline Integration and High-level Workflow
A typical end-to-end workflow orchestrated by RoomPlanner can be summarized as follows (see provided detailed pseudocode):
- Run the five-stage LLM planning cascade to extract geometry and semantic asset requirements from a text prompt plus optional requirements.
- Compute structured outputs for rooms, walls, doors, windows, and furniture with all constraints encoded as hard prompt rules.
- Synthesize an initial 3D point cloud from semantic geometry and object placements.
- Convert this geometric skeleton into a 3D Gaussian scene field representation.
- For each selected camera sampling mode (Zoom-In, Zoom-Out, Hybrid), compute a trajectory and execute neural field optimization and rendering.
- Output the scene, including geometry, camera trajectories, and style-aligned asset retrieval guidance.
A worked example: "a small Japanese-style living room with tatami mats" yields a 4 × 4 m rectangle with specified wall/floor materials, a shoji “paper” door, two north-wall slider windows, and a furniture arrangement constrained to low sofas and tea tables, culminating in a rendered 3D scene navigated via a hybrid door-to-spiral camera trajectory.
6. Editability, Efficiency, and Performance Observations
RoomPlanner is designed to preserve high-level editability: since every discrete module operates on a structured, reversible schema (JSON or coordinate arrays at each stage), the system supports partial editing and quick regeneration of scene geometry or attributes at any planner stage.
- Performance: The supplement indicates a field optimization over 1,500 iterations with up to 3 million points. Full pipeline timing and comparative speed/quality metrics against prior approaches are not stated, though the abstract claims total generation time under 30 minutes.
- Rendering Quality: The method is asserted to surpass previous approaches in both visual quality and editability, based on their experimental evaluations (details not included in the supplement).
- Limitations: The mathematical description of collision penalties, detailed camera interval sampling, or benchmark comparisons are not available; numerical or empirical performance bounds remain unspecified.
7. Position Relative to Existing Systems and Impact
RoomPlanner's decomposition of the scene specification problem into sequential, language-to-structure agent planning sets it apart from prior approaches that rely primarily on generative diffusion models conditioned on images or on shallow LLM prompt-to-3D pipelines. By making all spatial, topological, and object-level constraints explicit—enforced at the data-structural or prompt-engineering level rather than via implicit sampling or unconstrained generative flows—it offers a rigorous, interpretable methodology for LLM-driven interior scene synthesis.
Significantly, RoomPlanner represents a move toward deterministic, non-heuristic scene construction in LLM-integrated 3D graphics, and provides a blueprint for modular, highly constrained, and editably compositional 3D scene generation for AR/VR, robotics simulation, and interior design automation (Sun et al., 21 Nov 2025).