CubiCasa5K Dataset for Floorplan Analysis
- CubiCasa5K is a large-scale dataset of 5,000 CAD-derived floorplan images with detailed SVG annotations covering over 80 semantic categories.
- It supports multi-task research in semantic segmentation, object detection, and instance extraction using precise evaluation metrics like IoU and AP.
- Baseline models such as MuraNet demonstrate robust performance improvements, highlighting the dataset's impact on automatic floorplan image analysis.
CubiCasa5K is a large-scale, publicly available dataset designed to support research in automatic floorplan image analysis. Comprising 5,000 floorplan samples with high-resolution raster images and comprehensive vector annotations, CubiCasa5K addresses the scarcity of representative, richly annotated benchmarks for tasks such as semantic segmentation, object detection, and instance extraction in architectural drawings. The dataset encompasses over 80 semantic categories, enabling the investigation and development of advanced multi-task learning models for extracting meaningful structure from real-world floorplan imagery (Huang et al., 2023, Kalervo et al., 2019).
1. Dataset Composition and Annotation Protocol
CubiCasa5K consists of 5,000 floorplan images originally created in CAD and primarily sourced from Finnish real estate marketing materials. Each instance in the dataset includes both a rasterized image and its associated vector annotation in SVG (Scalable Vector Graphics) format. Images vary in original resolution from 430×485 up to 6,316×14,304 pixels (mean ≈1,399×1,597).
Object Classes and Instances
Semantic annotation covers more than 80 object categories, including:
- Rooms (e.g., kitchen, bedroom, bath, living room, hallway)
- Icons (e.g., window, door, sink, appliance, toilet, fireplace)
- Structural components (e.g., walls, railings, storage, chimney, staircase)
Aggregate instance counts (across all images):
- ~68,877 room instances
- ~136,676 icon instances
- ~147,024 wall segments
Each annotated object is defined by a closed polygon with a semantic label. The annotation pipeline employs a structured protocol: wall polygons are drawn first, followed by room segmentation (using wall boundaries), and finally icon/opening placement. Quality control includes annotator self-review and independent inspection.
Dataset Subsets and Metadata
Images are stratified into style subcategories:
- Architectural (3,732 images)
- High-quality (plain B/W, 992 images)
- Colorful (276 images)
Each sample is assigned its style metadata, but not an explicit building type. The dataset is split into training, validation, and test subsets (typical splits: 4,200/400/400 or, for some protocols, 4,000/500/500).
2. Tasks, Evaluation Metrics, and Baselines
CubiCasa5K enables multiple floorplan recognition tasks:
- Semantic segmentation (rooms, icons/openings)
- Heatmap regression for structural/semantic keypoints (e.g., wall junctions, door/window endpoints)
- Postprocessing for structural vector recovery (wall skeletons, room polygons, object polygons)
Quantitative Metrics
For segmentation and detection, the principal evaluation metrics are:
- Pixel Accuracy (per class):
- Intersection-over-Union (IoU):
- Mean IoU (mIoU):
- Average Precision (AP): , with as precision at recall
- AP at 0.5 IoU ()
- COCO-style mean AP ()
For detection, a predicted bounding box is a true positive if it achieves the prescribed IoU threshold against ground truth. For segmentation, per-class and overall performance are reported.
Baseline Model Performance
On the test split, a ResNet-152-based multi-task model yields the following:
| Task | Overall Acc (%) | Mean Acc (%) | Mean IoU (%) |
|---|---|---|---|
| Rooms | 82.7 | 69.8 | 57.5 |
| Icons | 97.6 | 61.5 | 55.7 |
| Rooms_P | 77.3 | 61.6 | 49.3 |
| Icons_P | 96.7 | 45.3 | 41.6 |
("_P" denotes evaluation on vectorized, polygonal outputs.) (Kalervo et al., 2019)
3. Data Preprocessing and Format
Images and segmentation masks are preprocessed according to the needs of downstream models. For joint segmentation/detection (e.g., in MuraNet), all images are resized to pixels to match backbone stride constraints. Area interpolation is employed for down-sampling and cubic interpolation for up-sampling.
SVG annotation files can be parsed using standard XML libraries. As exemplified:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import xml.etree.ElementTree as ET def load_svg_annotation(svg_path): tree = ET.parse(svg_path) root = tree.getroot() objects = [] for poly in root.findall('.//{http://www.w3.org/2000/svg}polygon'): label = poly.attrib.get('inkscape:label', poly.attrib.get('id')) points = [ tuple(map(float, p.split(','))) for p in poly.attrib['points'].strip().split(' ') ] objects.append({'label': label, 'polygon': points}) return objects |
The repository also provides tools to convert these SVG annotations into array masks suitable for deep learning pipelines.
4. Research Utilization and Model Benchmarks
CubiCasa5K serves as the evaluation basis for single- and multi-task learning approaches in floorplan analysis.
Example: MuraNet Multi-task Model
MuraNet integrates a unified encoder (MURA) with separate branches for segmentation (SegNeXt-inspired) and detection (YOLOX-style). Experiments compare MuraNet versus U-Net (segmentation) and YOLOv3 (detection):
| Model | Wall Seg. IoU (%) | APâ‚…â‚€ (%) | AP@.5:.95 |
|---|---|---|---|
| U-Net | 65.5 (base) | — | — |
| MuraNet | 78.4 (base) | — | — |
| YOLOv3 | — | 89.6 | 49.5 |
| MuraNet | — | 91.7 | 53.8 |
MuraNet consistently delivers higher segmentation IoU and detection AP over single-task baselines and converges in fewer epochs (e.g., 78.4% IoU in ~8 epochs vs. U-Net's 65.5% in ~6) (Huang et al., 2023).
This joint-task approach exploits architectural priors: walls, doors, and windows are tightly coupled semantically and spatially in architectural layouts, and sharing backbone activations improves feature learning.
5. Limitations, Challenges, and Future Directions
CubiCasa5K’s focus on Finnish residential floorplans concentrates stylistic diversity. Incorporation of global architectural types would enhance generalization. The dataset remains limited to 2D geometry; augmenting with 3D attributes (e.g., wall heights, multi-storey information) would extend its applicability to volumetric scene understanding. Current annotation categories, while numerous, omit fine-grained systems (e.g., HVAC, electrical wiring).
Key challenges for machine vision in this domain include:
- Extraction from low-texture, high-structure imagery (line drawings with minimal surface cues)
- Handling variable element scales and aspect ratios
- Accounting for semantic relations among spatial primitives (e.g., openings embedded in walls)
Architectures like MuraNet address these by implementing global relation attention and task-decoupled output heads.
Planned future directions, as noted in the original dataset release, include:
- Integrated detection heads for explicit bounding-box regression
- Augmentation with OCR to capture textual room labels
- Direct polygonal instance segmentation methodologies (Kalervo et al., 2019)
6. Availability and Impact
CubiCasa5K and reference model code are available via https://github.com/CubiCasa/CubiCasa5k. The dataset has established itself as the de facto benchmark for floorplan parsing, supporting research in 2D-to-3D apartment reconstruction, augmented reality, architectural CAD conversion, indoor navigation, and real estate analytics (Kalervo et al., 2019, Huang et al., 2023). Its scale, annotation richness, and open accessibility position it as a critical resource for advancing computational methods in structural scene understanding and related applications.