Structured3D: Photorealistic Indoor Dataset

Updated 15 November 2025

Structured3D is a large-scale, photorealistic synthetic dataset featuring detailed 3D indoor scenes with CAD-accurate geometry and semantic annotations.
It supports diverse applications such as room layout estimation, floorplan reconstruction, and 3D object detection through precise extraction of planes, lines, and junctions.
The dataset enables effective synthetic pretraining and transfer learning, significantly boosting performance on real-world indoor scene understanding benchmarks.

Structured3D is a large-scale, photorealistic synthetic dataset providing richly annotated 3D indoor scenes and multimodal image renderings for the study and benchmarking of structured 3D modeling, layout estimation, floorplan reconstruction, and various indoor scene understanding tasks.

1. Dataset Composition and Generation Pipeline

Structured3D is generated from a core of 3,500 professionally designed house models, encompassing 21,835 distinct rooms. Each house design, crafted for actual production, is stored in an industry-standard CAD format encoding detailed per-object geometry (walls, floors, ceilings), high-resolution textures/materials, and semantic labels covering room types, openings, and furniture.

The annotation pipeline automatically extracts structural primitives:

Planes: $P=\{p_i\}$ , each as a pair $\{n \in \mathbb{R}^3, d \in \mathbb{R}\}$ such that $n^\top x + d = 0$ .
Lines: $L=\{l_j\}$ , each representing the intersection of two planes, parameterized $l(u)=x_0 + u\cdot v$ .
Junctions: $X=\{x_k\}$ , as the intersection points $l_a \cap l_b$ for lines $l_a,l_b$ .

Annotation relationship matrices are automatically derived:

$W_1 \in \{0,1\}^{M \times N}$ (plane-line incidence), $[W_1]_{i,j}=1$ if line $l_j$ lies on plane $p_i$ .
$W_2 \in \{0,1\}^{N \times K}$ (line-point incidence), $[W_2]_{j,k}=1$ if $x_k$ on $l_j$ .

Higher-level structures—including cuboids (with $D_{2h}$ dihedral symmetry), Manhattan worlds ( $n_1, n_2, n_3$ mutually orthogonal), and semantic groupings—are also derived.

A photorealistic, physically-based rendering engine (Monte-Carlo GI + Embree path tracing) synthesizes data modalities:

Panoramic RGB equirectangular images (512 × 1024) and standard 720 × 1280 px perspective shots
16-bit ground-truth linear depth maps (meters)
Per-pixel semantic segmentation masks (roughly 20–25 classes)
Room-instance and object-instance identifiers
JSON/protobuf files containing all structural primitives and higher-level groupings

Rooms are rendered in multiple configurations (full, simple, empty furniture states; diverse lightings).

2. Annotation Types, Representations, and Organization

Annotations in Structured3D encode geometric and semantic structure at various levels:

Planes: $(n, d)$ parameterization; normals are unit vectors in a right-handed world frame ( $\mathbb{R}^3$ , $z$ up, meters). Polygonal boundaries result from plane-plane intersections.
Lines: Each as $(x_0, v)$ or as two 3D endpoints. Endpoints correspond to junctions.
Junctions: 3D coordinates (float64) from precise CAD intersections; always at line crossings.
Cuboids: Group of six orthogonal plane IDs; may be accompanied by the eight 3D corners (plane triple intersections).
Manhattan groupings: Planes clustered into three orthogonal axes ( $n_i^\top n_j \approx 0$ ).
Semantic groupings: Groupings into “room,” “door,” “window,” “furniture instance,” etc.
Precision and Consistency: All annotations are metric, globally registered, and stored with sub-millimeter accuracy (float64) in a single world coordinate system.

File organization is per-room, with all rendered modalities and structured annotations colocated for direct cross-modality alignment.

3. Benchmarks and Downstream Tasks

Structured3D serves as the foundation for benchmarking a spectrum of structured prediction tasks in indoor scene understanding.

Key tasks and example models:

Room Layout Estimation:
- Input: Single panoramic RGB
- Output: 3D room boundary (walls, floor, ceiling; cuboid/Manhattan assumptions)
- Example models: LayoutNet (corner-heatmap and wall-boundary CNN with post-hoc fitting); HorizonNet (three 1D scanline predictions for floor/ceiling/wall existence followed by 3D reconstruction).
Floorplan Polygon Reconstruction:
- Input: Multiview RGB-D, fused to density maps or point-cloud
- Output: Set of closed 2D polygons, each corresponding to a room
- Example models: RoomFormer (transformer with global context), RoIPoly (RoI-based transformer with vertex and logit queries; see also HEAT, LETR).

Training Protocols:

Model regimes include using only Structured3D, only real-world datasets (PanoContext, 2D-3D-S, etc.), joint balanced training, or pretraining/fine-tuning from synthetic to real.

Evaluation Metrics:

3D IoU (predict vs. GT volumetric overlap)
Corner Error (normalized $\ell_2$ of projected corners)
Pixel-wise error (face mask accuracy)
Polygon-based room/corner/angle precision, recall, F1:

$\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}, \text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}$

IoU-based room matching, proximity-based corner evaluation, and angular thresholds for edge accuracy.

Impact in benchmarks:

Augmenting real data with Structured3D improves cuboid IoU by +1.5–2 pp; pretraining then fine-tuning boosts by 3–8 pp. Domain adaptation (adversarial or otherwise) further improves transfer.

4. Floorplan Vectorization and Polygonal Modeling

Structured3D has become the go-to testbed for polygonal floorplan extraction, including recent innovations:

RoIPoly (2024) (Jiao et al., 20 Jul 2024): Employs a ResNet-50+FPN for initial feature extraction, a Sparse R-CNN-style region proposal head for axis-aligned rooms, and a transformer with explicit per-polygon queries for vertices and logit embeddings for validity. Hungarian matching aligns predictions to GT polygons.
- Losses:
- Vertex regression (L1), focal classification for valid/invalid detection.
- Training uses 3,000/250/250 train/val/test splits after filtering.
- Metrics:
- Room, corner, angle-level F1; RoIPoly achieves 95.5/83.6/78.3% (room/corner/angle F1) on Structured3D, second only to RoomFormer.
- Ablations reveal difficulty with non-rectilinear/narrow rooms due to RoI constraints.
MonteFloor (MCTS Polygon Refinement, 2022) (Stekovic et al., 2022):
- Initial pool of room-polygons generated via Mask-R-CNN.
- Proposal selection via MCTS, refined according to learned IoU metric, angle prior regularization (soft non-Manhattan), total-variation for gaps/overlaps, and closeness to Mask-R-CNN proposals.
- A differentiable module based on soft winding-numbers enables practical refinement in $<$ 100 ms per polygon.
- Achieves room/corner/angle F1 up to 0.96/0.89/0.86, outperforming baselines and demonstrating a ∼7–10% relative improvement over prior state-of-the-art.

A key theme is that Structured3D’s diversity (convex/non-convex, variety of corners/symmetries) necessitates methods beyond hard Manhattan/world or low-vertex-count heuristics.

5. Synthetic Pretraining and Transfer to Real-World Tasks

Structured3D’s scale and detail make it effective for pretraining large models for real-scene transfer.

Swin3D (Sparse 3D Swin Transformer, 2023) (Yang et al., 2023):
- Trains on ∼21 million points from 21,835 rooms, using a consolidated 21-class semantic vocabulary (original 25, four rare dropped).
- 1 cm $^3$ voxel downsampling and CAD-accurate labels support efficient learning.
- Empirical transfer: pretraining on Structured3D yields +1.8 mIoU on ScanNet val semantic segmentation, +2.3 mIoU on S3DIS Area 5, and up to +8.1 [email protected] for 3D object detection over from-scratch or other unsupervised 3D methods.
- Advantages include photorealistic noise-free input, wide variety of configurations, detailed semantic maps, and generality for large transformers.

This supports the use of Structured3D-derived models in downstream deployments, with synthetic-to-real adaptation becoming increasingly mainstream.

6. Extended Applications: Inpainting, Generative Modeling, Diminished Reality

Structured3D supports an array of high-level generative and editing tasks:

Diminished Reality (e.g., PanoDR (Gkitsas et al., 2021)): Inpainting by structural guidance, where models are trained on masked panoramas synthesized from Structured3D scenes to reconstruct empty or counterfactual scene variants. The accuracy of structure preservation is evaluated using specialized splits and standard metrics (e.g., PSNR, SSIM).
Fourier-enhanced Inpainting (Windowed-FourierMixer (Henriques et al., 28 Feb 2024)): Room inpainting and de-occlusion using U-Former backbones with windowed 1D height/width FFT mixers excel on Structured3D. Circular horizontal padding respects 360° panorama topology. Quantitative results demonstrate superior MAE (0.0089), PSNR (31.79 dB), and SSIM (0.9414) over layout-guided and FFT-only methods; ablations confirm the importance of local windowing and perceptual loss choice.
Text-to-3D Generation (Ctrl-Room (Fang et al., 2023)): Scene code diffusion models trained on filtered Structured3D rooms (bedrooms/living rooms only; 4,961/3,039) produce editable 3D arrangements from text, with ControlNet-enhanced stable diffusion providing plausible panorama synthesis. Evaluation uses FID, CLIP Score, IS, and human study; quality depends on leveraging the dense semantic and geometric annotations in Structured3D.

No direct ablation of Structured3D complexity is performed in Ctrl-Room, but evidence suggests the dataset's diversity supports global relation learning (orthogonality, spatial priors), avoiding narrow local-only artifacts.

7. Impact, Limitations, and Community Usage

Structured3D’s main impact arises from:

Scale and diversity (three orders of magnitude more scenes than limited-scan datasets),
Dense, multi-resolution ground truth (geometry, semantics, photorealistic imagery),
Alignment with real-world priors (layouts, procedural variation, and designer input).

Despite being synthetic, the dataset has catalyzed progress in multi-modal 3D understanding, polygonal floorplan induction, neural generative modeling, and cross-domain pretraining. Its strengths in transfer learning and floorplan parsing have been validated by ablation and generalization studies.

Some limitations include the absence of certain real-scene artifacts and a fixed set of CAD assets/textures. Nevertheless, Structured3D remains a reference dataset—for both benchmarking and large-scale representation learning—in structured 3D geometry, scene synthesis, and indoor spatial reasoning.