HouseLayout3D: Building-Scale 3D Layout Estimation
- HouseLayout3D is a framework for automated 3D reconstruction and semantic scene graph creation of multi-room, multi-floor buildings.
- It leverages real and synthetic RGB-D data to extract structural elements, optimize geometry, and recover inter-room connectivity including stairs and doors.
- The benchmark dataset and MultiFloor3D pipeline significantly advance building-scale indoor scene understanding, supporting navigation, architectural design, and embodied intelligence.
HouseLayout3D constitutes a research frontier in 3D indoor scene understanding, layout estimation, and generative modeling at the building scale. This field addresses the automated reconstruction, generation, and benchmarking of multi-room, multi-floor house layouts using real and synthetic data, enabling advances in embodied intelligence, architectural design, robotics navigation, and virtual/augmented reality. HouseLayout3D systems focus on full-building spatial reasoning, the recovery of structural and semantic relations, and the generation of physically plausible, richly annotated 3D scenes.
1. Motivation and Problem Definition
Traditional 3D indoor layout estimation pipelines have predominantly targeted single-room or single-floor datasets (e.g., Structured3D, ASE), limiting their ability to capture the architectural complexity and global spatial context of real-world buildings. This fragmentation prevents joint reasoning about multi-floor interconnections such as staircases and hinders downstream tasks like global navigation and scene graph construction. The core HouseLayout3D problem is formulated as: given a set of RGB-D frames with known poses (or another building-scale input), recover a collection of labeled planar polygons (plane equations, 3D vertices, semantic class) and a scene graph over rooms and inter-room connections (doors, windows, stairs), supporting a watertight, semantically rich 3D model of the entire house (Bieri et al., 2 Dec 2025).
2. The HouseLayout3D Benchmark Dataset
The "HouseLayout3D" dataset is the first real, multi-floor, multi-room benchmark with comprehensive structural annotation (Bieri et al., 2 Dec 2025). It consists of:
- Sources: 16 buildings from Matterport3D, 33 total floors, covering 317 rooms, 292 doors (open/closed + hinge), 379 windows (projected as axis-aligned wall rectangles), 34 staircases, and ≈26,000 RGB-D frames.
- Annotations: Each building includes a unified triangle mesh, a list of annotated 3D CAD polygons (walls, floors, ceilings, doors, windows, stairs), and a room connectivity/segmentation graph.
- Room graph: Encodes global adjacency with if rooms and are connected by a door or staircase.
- Format: Polygonal structural annotation for each element, with semantic tags and per-room assignment.
- Complexity: Supports variable numbers of rooms and floors (up to 40 rooms per floor, 5 floors per building), allowing for the evaluation of reasoning regarding multi-level vertical structures.
This resource closes the gap between synthetic single-room training sets and the practical demands of end-to-end scene understanding in real, complex buildings.
3. Learning-Free Full-Building Estimation: MultiFloor3D Pipeline
To establish a baseline for building-scale layout recovery, the MultiFloor3D method provides a training-free pipeline (Bieri et al., 2 Dec 2025):
- Mesh Reconstruction: Unposed RGB images are processed by COLMAP (SfM) for camera poses, Metric3D for per-view depth estimation, and 3D Gaussian Splatting (DN-Splatter) to produce a global mesh via Poisson surface reconstruction.
- Layout Skeleton Extraction: OneFormer segments semantic classes per image, which are backprojected to mesh vertices. Superpoint clustering yields coarse 3D labels, extracting structural skeletons (walls, floors, ceilings), small object subsets, and stair subsets.
- Layout Prototype Fitting: Planes are fit to skeleton clusters, vertices and parameters are optimized under geometry/empty-space/connectivity/simplicity losses:
Floor holes are filled by projection, walls/ceilings are extended based on missing geometry.
- Scene Graph Parsing and Room Extrusion: Floors are identified by clustering polygon heights, then 2D floorplans are built. Hov-SG segments rooms with wall layout, stair clusters link adjacent floors. Room volumes are extruded via constrained Delaunay triangulation and vertical ray casting.
- Window Fitting: Project window-class pixels onto walls, cluster, and fit axis-aligned rectangles.
- Output: A polygonal, semantically labeled, watertight house model and room–connectivity graph.
4. Benchmark Metrics and Evaluation Protocols
HouseLayout3D evaluation is multi-faceted (Bieri et al., 2 Dec 2025):
- Corner Localization Error: For rectangular elements (doors, windows), maximum corner error under optimal matching; F1-score at given thresholds.
- Polygon Hausdorff Distance: For arbitrary surfaces (walls/floors/ceilings), as the maximum directed nearest-neighbor distance over polygon vertices; F1@threshold.
- 3D Intersection-over-Union (IoU): Volume intersection/union for entire room polyhedra.
- Depth Consistency: Proportion of view rays where predicted-minus-ground-truth depth is under a specified threshold.
- Vertex count and qualitative floorplan comparison: To capture the compactness and accuracy of layout recovery.
5. Empirical Results and Comparative Analysis
Empirical results confirm MultiFloor3D's superiority over learned, single-floor approaches (Bieri et al., 2 Dec 2025):
- Structure [email protected]: MultiFloor3D 0.40, versus RoomFormer 0.24 and SceneScript 0.28.
- Doors [email protected]: MultiFloor3D 0.55 versus 0.23 each for others.
- Windows and Stairs [email protected]: MultiFloor3D 0.43 (windows), 0.42 (stairs); others fail at multi-floor stair detection.
- Depth consistency: At strict tolerance (), MultiFloor3D achieves 61.1% (HouseLayout3D), outperforming RoomFormer (24.9%) and SceneScript (22.5%).
- Qualitative success: Recovery of complex, non-rectangular rooms, sloped ceilings, staircase connectivity, and robust detection of windows and doors even with partial image coverage.
Ablation studies show performance drops without prototype fitting or room segmentation, confirming the necessity of each stage in the pipeline.
6. Limitations, Failure Cases, and Future Directions
MultiFloor3D, despite not using ground-truth masks, demonstrates state-of-the-art building-scale recovery, but several open challenges are highlighted (Bieri et al., 2 Dec 2025):
- Limitations:
- Runtime: 1–2 hours per scene on high-end GPU due to global optimization and mesh processing.
- Occasional incorporation of outdoor geometry through windows.
- Under-segmentation of complex staircases or incomplete wall recovery in very noisy or sparse meshes.
- Importance of global context: Stair linkages and wall/ceiling extrapolation can only be robustly inferred with global, multi-level spatial reasoning.
- Research opportunities:
- End-to-end deep models that process entire buildings and reason jointly over multi-floor structures.
- Hybrid approaches combining feed-forward predictions with joint global optimization.
- Synthetic augmentation to increase diversity and size of pretraining data.
- Vision–LLM integration for high-level semantic reasoning over recovered scene graphs, to support navigation and planning.
This suggests future HouseLayout3D systems will likely require joint modeling of structural, semantic, and relational cues at building scale, with dense annotation and physically grounded evaluation.
7. Broader Impact and Role in the 3D Scene Understanding Ecosystem
HouseLayout3D provides the first comprehensive real-world benchmark enabling the rigorous evaluation of algorithms aiming at full-building 3D understanding. It supports the development of:
- Navigation and localization systems that require a building-scale, semantically segmented, metrically accurate scene representation.
- Architectural/design automation, providing ground-truth for layout generation tasks.
- Embodied intelligence: high-fidelity environments for AI agents to perform perception, planning, and interaction tasks in realistic homes.
- Benchmarking protocols for the new class of LLM-driven and hybrid vision-language 3D reasoning models emerging in the field.
By enabling holistic evaluation, HouseLayout3D supports reproducible, comparative progress across diverse approaches—data-driven, optimization-based, and vision-language hybrid—shaping the next generation of house-scale indoor scene understanding and generative frameworks (Bieri et al., 2 Dec 2025).