RoomFormer: Transformer for Floorplan Reconstruction

Updated 30 October 2025

RoomFormer is a transformer-based model that directly reconstructs 2D floorplans from 3D scans using a novel two-level query mechanism.
Its architecture leverages room-level and corner-level queries to enable holistic, parallel prediction of variable-size room polygons, reducing errors from multi-stage pipelines.
RoomFormer achieves state-of-the-art geometric and semantic floorplan recovery with fast inference (0.01s per scene) and robust cross-dataset generalization.

RoomFormer is a transformer-based architecture specifically designed for 2D floorplan reconstruction from 3D indoor scans or point clouds. This model introduces a two-level query mechanism within a single-stage feed-forward Transformer, enabling holistic, parallel prediction of variable-size polygon sets where each polygon represents a room and is encoded as a variable-length sequence of ordered vertices. RoomFormer demonstrates state-of-the-art performance in both geometric and semantic floorplan recovery across challenging benchmarks such as Structured3D and SceneCAD, with substantial improvements in speed and robustness over prior approaches.

1. Problem Formulation and Context

RoomFormer reformulates the floorplan reconstruction problem as structured prediction: for any given scan-derived 2D density map, the objective is to simultaneously generate a set of polygons (each one representing a room), where every polygon is a sequence of ordered vertices. This paradigm diverges from multi-stage, heuristic pipelines that sequentially predict corners, edges, and rooms; instead, RoomFormer aims for direct, end-to-end, variable-size polygon sequence generation. All polygon predictions share the same network and do not require explicit room segmentation or edge/corner detection.

A plausible implication is that this approach avoids cascading errors typical of multi-stage systems and enables the network to exploit global geometric reasoning over the whole floorplan, including complex room relationships and topology.

2. Two-Level Query Transformer Architecture

Central to RoomFormer is the use of a two-level query matrix within the Transformer decoder. Specifically:

Room-level queries: Select the maximum number of rooms $M$ , so input queries are allocated to room polygons.
Corner-level queries: Each room polygon comprises up to $N$ corner queries representing vertices.

Queries are organized as $Q \in \mathbb{R}^{M \times N \times 2}$ . Each slot outputs:

Vertex validity $c_n^m \in \{0,1\}$ (corner present/padding).
Vertex coordinates $p_n^m \in \mathbb{R}^2$ .
Optionally, semantic room type or architectural element class.

The queries are iteratively refined through the decoder stack, using self-attention and cross-attention. Self-attention is performed across all corners/rooms, capturing both intra-room and inter-room dependencies. Cross-attention selectively pools features from the CNN backbone at predicted vertex locations, exploiting local and global context via deformable attention.

This two-level query design is critical for directly modeling variable-size output sets (rooms and corners) and sets RoomFormer apart from single-level or sequential graph-based approaches. An ablation demonstrates substantial performance degradation when using only a single query level.

3. Model Pipeline and End-to-End Training

Pipeline outline:

Input: 2D density map (from vertical aggregation of 3D scan/point cloud).
Feature Extraction: CNN backbone yields multi-scale feature maps.
Global Context Encoding: Multi-scale deformable attention Transformer encoder.
Decoder: Two-level queries predict all rooms and their boundary vertices in parallel.
Feed-Forward Head: Predicts vertex validity and coordinates for each corner, optionally room type or architectural element class.
Polygon Matching: Variable-size alignment between predicted and ground-truth polygons via the Hungarian algorithm, allowing for cyclic permutations due to polygonal symmetry.

Matching Losses:

Vertex classification loss:

$\mathcal{L}_\text{cls}^m = -\frac{1}{N} \sum_{n=1}^N c_n^m \log \hat{c}_n^{\hat{\sigma}(m)} + (1 - c_n^m) \log (1 - \hat{c}_n^{\hat{\sigma}(m)})$

Coordinate regression loss: minimum $L_1$ distance across all polygon rotations (due to cyclic symmetry).
Auxiliary rasterization loss: Dice loss between rasterized polygon masks.
Total loss: weighted sum over rooms; only valid rooms contribute.

Output: Variable-size set of room polygons, each as sequence of ordered corners (in 2D).

All predictions are produced in a single feed-forward pass, enabling fast inference (0.01s/scene) and robust set/sequence prediction without explicit post-processing or assembly.

4. Semantic Enrichment and Extensibility

RoomFormer is readily adapted to semantic prediction:

Room type classification: Aggregates corner-level features and passes through a linear classifier, labeling rooms (and assigning "empty" to invalid).
Architectural element prediction: Doors and windows are predicted as degenerate room polygons (2-point sequences) or via a dedicated line decoder operating alongside the main decoder.

This semantic enrichment supports detailed floorplan recovery beyond raw geometry, encompassing types and functional elements, which is further reflected in the model’s strong performance on F1 scores for semantic types and architectural components.

5. Quantitative and Qualitative Evaluation

RoomFormer demonstrates superior performance and generalization on SceneCAD and Structured3D:

Metric	RoomFormer	HEAT	Floor-SP	HAWP	LETR
Room F1	97.3	95.4	94.6	95.1	94.1
Corner F1	87.2	82.5	83.8	81.1	82.9
Angle F1	81.2	78.3	79.6	77.2	78.6
Inference(s)	0.01	0.11	113.8	790.7	70.1
Room IoU	91.7	—	—	—	—

The model is robust to missing or noisy input (density maps) and outperforms sequential and heuristic approaches in variable room/corner count prediction, topological validity (no overlaps/self-intersections), and semantic richness. Cross-dataset generalization experiments affirm the learned query mechanism and feature encoding as effective across domains.

Qualitatively, RoomFormer produces more faithful, closed polygons and avoids errors from sequential pipelines (error accumulation, missed rooms/corners). Architectural elements are consistently captured.

6. Architectural Comparisons and Implications

RoomFormer is distinguished from prior approaches by:

Aspect	RoomFormer	Previous Methods
Output granularity	Parallel rooms as polygons/sequences	Sequential corners/edges or room masks
Output size variability	Variable, via queries	Fixed or heuristic
End-to-end pipeline	Yes	No (multi-stage)
Semantic extension	Yes (room/element types)	Rare
Speed	Very fast (0.01s/scene)	Slow (0.1-800s/scene)
Generalization	Strong cross-dataset	Poorer

This suggests that the two-level query transformer paradigm is effective for structured multi-polygon prediction, with implications for related domains (e.g., semantic segmentation, structured 2D/3D geometry tasks).

7. Directions, Limitations, and Connections

RoomFormer can readily be extended to richer semantic tasks, including architectural detail recovery and multi-floor modeling. The model's performance is contingent on density map quality and can be affected by extreme missing data or highly non-standard topologies. Recent works (PolyRoom (Liu et al., 2024)) extend the two-level query concept to room-aware initialization and self-attention, further improving memory efficiency and geometric validity, suggesting an active direction in transformer-based representation for indoor geometry.

RoomFormer is foundational to the current generation of floorplan recovery from raw scans, and its architectural design underlies many contemporary and follow-up models for indoor structure prediction.

Markdown Upgrade to Chat

References (1)

PolyRoom: Room-aware Transformer for Floorplan Reconstruction (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoomFormer.