Papers
Topics
Authors
Recent
2000 character limit reached

CubiCasa5K Dataset Overview

Updated 9 December 2025
  • The CubiCasa5K dataset is a large-scale collection of 5,000 floorplan images with dense polygon-based annotations for detailed semantic and geometric parsing.
  • It offers diverse architectural styles with rigorous train/val/test splits, supporting reliable evaluation of multi-task CNN models.
  • The dataset underpins advanced research, with baseline results demonstrating significant improvements in semantic segmentation and structural element extraction.

The CubiCasa5K dataset is a large-scale resource for automatic floorplan image analysis, comprising 5,000 rasterized floorplan images sourced primarily from Finnish real estate marketing material. Each sample includes dense, polygon-based annotations spanning over 80 object categories, enabling both geometric and semantic parsing. The dataset was constructed to address the scarcity of publicly-available, representative, and meticulously annotated floorplan image collections, particularly for use in machine learning and computer vision research on building interiors. A comprehensive baseline is provided with an improved multi-task convolutional neural network (CNN) architecture for semantic segmentation and structural element extraction (Kalervo et al., 2019).

1. Dataset Structure and Composition

CubiCasa5K offers a diverse, style-aware corpus. From an initial pool of approximately 15,000 images, 5,000 were selected using explicit criteria on clarity, completeness (full single-floor layouts), and the visibility of all key architectural elements (walls, rooms, doors, windows, furniture).

  • Subsets by style:
    • High-quality architectural: 3,732 images
    • High-quality black-and-white CAD-style: 992 images
    • Colorful, hand-drawn/marketing style: 276 images
  • Train/val/test partitioning:
    • Training: 4,200 images
    • Validation: 400 images
    • Test: 400 images

Each split was sampled to preserve style and size variability. Only legible, fully scanned, single-floor images with all necessary architectural elements were retained.

2. Annotation Protocol and Object Taxonomy

Polygonal annotation is performed for every semantically meaningful object using SVG format, with a rigorous 2-stage quality assurance protocol: initial self-review by the annotator, then independent QA engineer verification.

  • Annotation workflow:
  1. Draw wall polygons (distinguishing outer and inner boundaries)
  2. Draw room polygons (covering all enclosed cells)
  3. Place icons and mark opening polygons
  • Category distribution (major classes):
    • Rooms (12 baseline classes): Background, Outdoor, Wall, Kitchen, Living Room, Bedroom, Bath, Hallway, Railing, Storage, Garage, Other Rooms
    • Icon/opening (11 baseline classes): Window, Door, Closet, Electrical Appliance, Toilet, Sink, Sauna Bench, Fire Place, Bathtub, Chimney, Empty
  • Objects per image (averages across 5,000 samples):
    • Walls: ~29.4
    • Rooms: ~13.8
    • Icons: ~27.3

The overall taxonomy comprises approximately 83 fine-grained classes, with details available in the project's repository.

3. Dataset Statistics and Comparative Overview

CubiCasa5K demonstrates substantial diversity and scale relative to prior datasets. Image widths range from 50 to ~8,000 pixels (median: 1,500 px; modes: ~600 px and 2,000 px). The number of rooms per floorplan peaks at 8–12; walls at 20–30 segments; icons typically 15–30 per image.

Comparative metrics with existing datasets:

Dataset #Images Resolution Range #Object Classes #Rooms
R-FP-500 [dodgeMVA’17] 500 56–1,427 N/A N/A
CVC-FP [Heras ’15] 122 905–7,383 50 1,320
Liu et al. ’17 815 96–1,920 27 7,466
CubiCasa5K 5,000 50–8,000 83 68,877
  • Bedrooms are the most frequent room type (~16% of all room polygons), followed by Kitchens, Living Rooms, and Bathrooms.
  • Doors (~20%), Windows (~18%), Sinks, and Toilets are the most common icons.

4. Mathematical Formalization

Let IRH×W×3I \in \mathbb{R}^{H \times W \times 3} denote the input rasterized floorplan image. Ground-truth annotations are given by G={(Pi,ci)}i=1NG = \{ (P_i, c_i) \}_{i=1}^N, where Pi=(xi,1,,xi,ki)P_i = (x_{i,1}, \ldots, x_{i,k_i}) is a kik_i-vertex polygon and ci{1,,C}c_i \in \{1, \ldots, C\} is the object label.

The baseline network fθ(I)f_\theta(I) produces:

  • SroomsRH×W×RS_{\text{rooms}} \in \mathbb{R}^{H \times W \times R}: per-pixel scores for the RR room classes
  • SiconsRH×W×KS_{\text{icons}} \in \mathbb{R}^{H \times W \times K}: per-pixel scores for KK icon classes
  • {Hj}j=1M\{H_j\}_{j=1}^M: MM heatmaps for wall junctions, icon corners, and opening endpoints

5. Baseline Multi-Task Convolutional Neural Network

The provided baseline utilizes a ResNet–152 backbone (ImageNet → MPII pose transfer), and an hourglass decoder (10 blocks with skip connections). The architecture emits two semantic-segmentation heads (rooms, icons) and 21 heatmap regression heads.

  • Training:

    • Loss function follows Kendall et al. (2018) multi-task uncertainty:

    Ltot=LH+LS\mathcal{L}_{\text{tot}} = \mathcal{L}_H + \mathcal{L}_S - Heatmap regression loss with learned uncertainty σi\sigma_i:

    LH=i=1M[12σi2HigtHiθ22+log(1+σi)]\mathcal{L}_H = \sum_{i=1}^M \left[ \frac{1}{2 \sigma_i^2} \| H_i^{gt} - H_i^\theta \|_2^2 + \log (1 + \sigma_i) \right] - Cross-entropy segmentation loss per task with uncertainty σk\sigma_k:

    LS=k{rooms,icons}[1σk(pyk,plogSk,p)+logσk]\mathcal{L}_S = \sum_{k \in \{\text{rooms}, \text{icons}\}} \left[ \frac{1}{\sigma_k} \left(-\sum_p y_{k,p} \log S_{k,p}\right) + \log \sigma_k \right]

  • Optimization: Adam (learning rate 1e-3; β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999), batch size 20, up to 400 epochs. Training uses data augmentation (random 90° rotations, color jitter, random crop/scale to 256×256, zero-padding).
  • Hardware: Single NVIDIA Titan X; full training completes in approximately 3 hours.

6. Evaluation Outcomes

6.1. Benchmarking on Liu et al. ’17

The network demonstrates consistent improvements over Liu et al. ’17, achieving the following (with and without integer programming (IP) post-processing and test-time augmentation (TTA)):

Method Junction acc/rec Opening acc/rec Icon acc/rec Room acc/rec
Liu et al. ’17 70.7 / 95.1 67.9 / 91.4 22.3 / 77.4 80.9 / 78.5
Liu et al. + IP 94.7 / 91.7 91.9 / 90.2 84.0 / 74.6 84.5 / 88.4
Ours 82.4 / 92.0 82.3 / 93.3 34.6 / 88.3 90.0 / 87.6
Ours + IP 94.1 / 89.6 93.2 / 92.6 92.9 / 87.7 91.7 / 90.8
Ours (TTA) + IP 95.0 / 89.7 94.5 / 92.9 93.6 / 87.3 92.2 / 90.2
  • The baseline outperforms Liu et al. on all metrics without post-processing. Addition of IP and TTA pushes accuracy and recall above 92% for all categories.

6.2. Semantic Segmentation Results on CubiCasa5K

The primary evaluation treats parsing as a pixel-wise semantic-segmentation task, reporting overall accuracy, mean class accuracy, and mean Intersection-over-Union (IoU):

Task Overall Acc Mean Acc Mean IoU
Rooms val 84.5% 72.3% 61.0%
Rooms test 82.7% 69.8% 57.5%
Rooms (Poly) test 77.3% 61.6% 49.3%
Icons val 97.8% 62.8% 56.5%
Icons test 97.6% 61.5% 55.7%
Icons (Poly) test 96.7% 45.3% 41.6%

Post-processing to extract actual polygons ("Poly") introduces accuracy drops, primarily due to errors in junction detection. On the test set, mean room-IoU is ~57.5% and icon-IoU is ~55.7%.

7. Limitations and Prospects

Despite its scale, very rare architectural symbols (frequency < 1%) remain under-represented. Current post-processing relies on heuristics; errors, especially in junction localization, can propagate and undermine polygon extraction. Future work suggested by the authors includes integration of an explicit object-detection head for icons (as in Dodge et al. ’17), exploration of direct polygon regression approaches (cf. Acuna et al. ’18), and extension of both dataset and methodology to multi-floor or 3D structures such as stairs and elevators.

The data, SVG annotations, and baseline implementations are publicly available at https://github.com/CubiCasa/CubiCasa5k, providing a standardized foundation for further work in automatic floorplan parsing, structural scene understanding, and downstream AR/VR applications (Kalervo et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CubiCasa5k Dataset.