MultiFloor3D: Training-Free 3D Layout Estimator

Updated 9 December 2025

MultiFloor3D is a training-free computational framework for holistic 3D layout estimation of multi-storey buildings using geometry-driven optimization and semantic segmentation.
It synergistically combines off-the-shelf tools like COLMAP, OneFormer, and DBSCAN to recover precise CAD-style vectorized layouts, including stairs and room connectivity.
The method outperforms traditional deep models on benchmarks like HouseLayout3D and ScanNet++ by preserving global spatial coherence without requiring retraining.

MultiFloor3D is a training-free computational framework for holistic 3D layout estimation of multi-storey building environments from unordered RGB or RGB-D imagery. It serves as the principal baseline for the HouseLayout3D benchmark and is the first method to recover global CAD-style vectorized layouts—including true 3D room partitions, doors, windows, floors, ceilings, and multi-level connections such as staircases—over entire building scans in a single optimization pass without any learning or retraining. Its architecture synergistically leverages off-the-shelf multi-view reconstruction, deep semantic segmentation, clustering, and combinatorial optimization to enforce topological correctness and parsimonious geometric representation, establishing a new bar for robust building-scale perception in the wild (Bieri et al., 2 Dec 2025).

1. Motivation and Distinction from Prior Models

Conventional 3D layout models (e.g., SceneCAD, SceneScript, RoomFormer) are trained primarily on synthetic single-room or single-floor datasets. To apply them to multi-floor scans, practitioners must split input data manually by floor or room, apply neural prediction to each segment, and attempt to merge predictions—losing spatial context, introducing staircase discontinuities, and compounding errors in room or object connectivity. MultiFloor3D eliminates these constraints by ingesting all 2D views and reconstructing a global, multi-floor vectorized layout via geometry-driven optimization, preserving staircase logic and inter-floor spatial coherence (Bieri et al., 2 Dec 2025).

2. Sequential Pipeline Architecture and Algorithms

MultiFloor3D executes in four major stages:

3D Mesh Reconstruction: Camera poses are estimated with COLMAP. Posed RGB images are fused using DN-Splatter and Metric3D to obtain a dense multi-view 3D Gaussian splatting model. Poisson surface reconstruction yields a watertight mesh; per-image depth maps are rendered for subsequent “free-space” validation.
Layout Skeleton Extraction: Each RGB view is semantically segmented by a pre-trained OneFormer model (no retraining). Labeled semantic points are back-projected into the mesh and clustered (following Robert et al.) by superpoints. Structural points (walls, floors, ceilings, large furniture) are preserved, while unreliable classes (windows, mirrors) and small objects are removed or later used to infer missing floor regions. The result is a raw skeletal mesh of labeled structural geometry.
Fitting a Layout Prototype: Structural superpoints are fitted to planar polygons (via PCA/RANSAC), forming an initial set $\mathcal{P}$ of polygons. Optimization minimizes a composite objective:

$L = L_\mathrm{geom} + L_\mathrm{connect} + L_\mathrm{simple}$

$L_\mathrm{geom}=L_\mathrm{prox}+L_\mathrm{empty}$ .
$L_\mathrm{prox}=\sum_{v\in V_\mathrm{skel}}\min_{P\in\mathcal{P}} D_\mathrm{pp}(P,v)$ .
$L_\mathrm{empty}$ penalizes polygons intersecting empty free-space rays.
$L_\mathrm{connect}$ enforces proximity of adjacent polygons' vertices.
$L_\mathrm{simple}$ penalizes dangling or unshared boundary edges.

The position and parameters of all polygons are refined via gradient descent. Every $K$ iterations, vertex merging (distance threshold $\tau_\mathrm{merge}$ ), Ramer–Douglas–Peucker (RDP) simplification, and adjacent co-planar polygon merging are performed. Holes in floor or wall surfaces are closed by back-projecting discarded small-object triangles and extending boundary gaps.

Scene Graph Parsing and CAD Layout: Floors are identified by vertical clustering of polygons. For each floor, 2D outlines are produced and Hov-SG is used for plan segmentation into rooms. Room connections (doors, stairs) are inferred from wall and stair mesh instances; staircases are clustered and automatically linked to participating floors. Each room’s 2D polygon is triangulated, and triangle centers are vertically assigned to corresponding ceiling planes, enabling extrusion to 3D with consistent wall and ceiling heights. Windows are back-projected from semantic segmentation, clustered (DBSCAN), and fit with axis-aligned rectangles on wall polygons.

3. Evaluation, Benchmarks, and Comparative Performance

MultiFloor3D was evaluated on the HouseLayout3D benchmark (16 buildings, 33 floors, 317 rooms, 34 staircases), using metric recall ([email protected]), depth accuracy ( $\Delta_5$ and $\Delta_{10}$ in %), and compactness (vertex count):

Method	Structures [email protected]	Doors [email protected]	Windows [email protected]	Stairs [email protected]	$\Delta_5$ (%)	$\Delta_{10}$ (%)	#Vertices
RoomFormer	0.24	0.23	0.07	–	24.9	32.9	765
SceneScript	0.28	0.23	0.16	–	22.5	33.8	677
MultiFloor3D	0.40	0.55	0.43	0.42	61.1	76.3	1957

MultiFloor3D outperforms per-floor deep models in both geometric and semantic F1 scores and is the only method to natively reconstruct stairs and complete room connectivity (Bieri et al., 2 Dec 2025).

On the ScanNet++ benchmark (50 scenes), MultiFloor3D also achieved highest depth consistency ( $\Delta_5$ = 67.8%, $\Delta_{10}$ = 84.7%) and competitive compactness. Ablation studies document the necessity of each pipeline component: omitting prototype fitting or room segmentation degrades mean F1 (0.214 and 0.359 respectively, vs. 0.381 full pipeline).

4. Integration with Off-the-Shelf Scene Understanding and Optimization Components

MultiFloor3D deliberately forgoes any train-time domain adaptation or fine-tuning, instead combining a pipeline of existing components:

3D geometry: COLMAP (SFM), DN-Splatter (RGB-D fusion), Poisson reconstruction.
Segmentation: OneFormer for dense class prediction.
Superpoint clustering: Robert et al. for mesh refinement.
2D/3D parsing: Hov-SG for floorplan/room segmentation, DBSCAN for window extraction.
Polygonal simplification: RDP and custom merge heuristics.

No pipeline component is tuned for dataset-specific idiosyncrasies, avoiding overfitting or domain gap phenomena typical in synthetic-to-real transfer scenarios. All loss terms, clusterings, and merging policies are grounded in geometric and topological constraints.

5. Advantages, Limitations, and Failure Modes

Key advantages:

No training required: Entire pipeline functions without retraining or synthetic-to-real adaptation.
Global multi-storey reasoning: Processes complete buildings, preserving stair and inter-floor connections.
Explicit vectorized output: CAD-native representation of building shells and scene graph for robotics, planning, or design tasks.
Benchmark leadership: Surpasses deep feed-forward architectures on the HouseLayout3D and ScanNet++ metrics, including new capabilities for stair recovery (Bieri et al., 2 Dec 2025).

Noted limitations:

Computational cost: Inference time is 1–2 hours per scene on an RTX 4090, compared to 1–2 minutes for feed-forward baselines.
Reconstruction/segmentation noise: Artifacts (e.g., outdoor vegetation leaking through glass walls) may induce spurious thin polygons or floorplan holes.
Polygonal fitting fragility: Polygonal prototype optimization can fail under extreme mesh degeneracy or poor segmentation quality.

6. Practical Impact and Prospective Extensions

MultiFloor3D establishes a technical reference for future holistic, training-free 3D building modeling, with direct application to downstream robotics, architecture, navigation, and semantic scene parsing. Its explicit scene graph output is compatible with advanced planning frameworks and enables programmatic access to rooms, doors, stairs, and other building entities.

Potential directions include development of global context–preserving end-to-end neural models, acceleration of prototype layout optimization (e.g., with learnable solvers), and extension to non-orthogonal, curved, or non-Manhattan geometries. The explicit global structure and CAD output suits emerging applications in high-level semantic navigation (cf. MuNES, which addresses multifloor mapping and planning using barometric and LiDAR cues (Jung et al., 2024)) and architectural generative modeling (cf. BuildingBRep-11K for precise synthetic solids (Guo et al., 3 Jun 2025)).

7. Relationship to Contemporary Benchmarks and Datasets

MultiFloor3D is designed to enable direct benchmarking and architectural comparison with the HouseLayout3D benchmark, which focuses on faithful multi-floor and staircase-rich indoor scene representation. It also aligns structurally with datasets such as BuildingBRep-11K, which provides manifold multi-storey B-Rep solids annotated for deep geometric reasoning, albeit from synthetic sources (Guo et al., 3 Jun 2025). The emergence of HouseLayout3D and MultiFloor3D signals a transition toward truly holistic, building-scale estimation in 3D scene understanding, emphasizing semantic connectivity, vectorization, and the recovery of real-world interior complexity (Bieri et al., 2 Dec 2025).