Cityscapes Dataset Overview
- Cityscapes is a benchmark dataset offering high-resolution, stereo urban scenes with comprehensive pixel-level and instance-level annotations across 50 cities.
- It supports diverse tasks such as semantic segmentation, instance segmentation, 3D detection, and part-level parsing, evaluated using metrics like mIoU and AP.
- The dataset’s ecosystem includes specialized extensions (e.g., Cityscapes-Panoptic-Parts, CityPersons) that enhance research in occlusion handling and fine-grained object recognition.
The Cityscapes dataset is a large-scale benchmark suite designed to advance pixel-level and instance-level semantic understanding of complex urban street scenes. Captured in 50 cities—primarily in Germany but also in neighboring countries—it is distinguished by high-resolution stereo video imagery and richly detailed, per-pixel ground truth. Cityscapes has become a central reference for evaluating deep learning models in semantic segmentation, instance segmentation, 3D detection, part-level parsing, amodal scene understanding, and object attribute recognition for autonomous driving. Its extensibility has resulted in a broad ecosystem of specialized derivatives, including Cityscapes-Panoptic-Parts, Cityscapes 3D, CityPersons, Amodal Cityscapes, and the Cityscapes Attributes Recognition (CAR) dataset.
1. Dataset Composition, Splits, and Label Hierarchy
Cityscapes comprises several hundred thousand stereo frames, with a manually curated subset of 5,000 images assigned fine pixel‐level annotations and an additional 20,000 with coarse annotations. The finely annotated portion forms the standard basis for training and benchmarking and is split at the city level into 2,975 training, 500 validation, and 1,525 test images. Each split is selected to balance city size, geographic region, and seasonality, ensuring statistical diversity across the dataset (Cordts et al., 2016).
Fine-grained annotation encompasses 30 classes of visual entities grouped into eight high-level categories—such as flat, construction, nature, vehicle, object, human, sky, and void—though only 19 of these classes are included in standard benchmark evaluations. Pixel-level annotations are produced using layered polygons, with a strict protocol enforcing back-to-front labeling for occlusion rendering and precise region connectivity. Instance-level annotations are provided for all human and vehicle classes, delineating each object instance as a distinct segmentation mask (Cordts et al., 2016).
2. Benchmark Tasks and Evaluation Metrics
Cityscapes defines two primary tasks: semantic segmentation (per-pixel class prediction) and instance-level segmentation (object-level mask and score prediction). For semantic segmentation, the principal metric is mean Intersection over Union (mIoU) across the 19-class subset. This is formulated as
where and are the predicted and ground-truth pixels for class . Category-level IoU (averaging over broader semantic groupings) and instance-normalized IoU (iIoU), which weights contributions by relative instance size, are also employed (Cordts et al., 2016).
Instance segmentation is evaluated via a region-level Average Precision (AP) metric, following the MS COCO protocol, with IoU thresholds ranging from 0.5 to 0.95. Specific sub-scores for objects within 50 m and 100 m are reported to quantify performance at varying distances. Oracle experiments confirm that the bottleneck for instance segmentation is the accuracy of instance proposal generation, particularly in dense, cluttered urban scenes (Cordts et al., 2016).
3. Derived Cityscapes Datasets and Extensions
Cityscapes-Panoptic-Parts
Cityscapes-Panoptic-Parts (CS-PP) extends fine annotations to include part-level masks for five "thing" classes (person, rider, car, truck, bus), resulting in a three-level hierarchical label format: semantic, instance, and part. The panoptic_parts_id is encoded as
with 23 part classes in total (e.g., torso, head, arm, leg for persons; window, wheel, chassis for vehicles), supporting fine-grained part-aware segmentation (Meletis et al., 2020). Tools for hierarchical PNG/JSON management, visualization, and COCO-format conversion are provided.
Cityscapes 3D
Cityscapes 3D augments the original dataset with 9-DoF (degrees of freedom) 3D bounding box annotations for all vehicle types, using only stereo RGB imagery and per-image calibration. This enables pixel-accurate projection of 3D boxes in the image plane and supports paired 2D–3D instance mapping. The benchmark introduces the mean Detection Score (mDS), which combines 2D AP, BEV center distance, yaw similarity, pitch-roll similarity, and size similarity for comprehensive 3D detection ranking (Gählert et al., 2020, Ye et al., 2023). The dataset covers ≈27,800 vehicle annotations in train/val, and is particularly suited for multi-task learning (joint 2D and 3D scene understanding).
CityPersons
CityPersons provides a high-quality pedestrian detection benchmark, layering ≈35,000 full-body and visible-body aligned bounding boxes over Cityscapes' fine split. The annotation covers four categories: pedestrian, rider, sitting person, and “other,” with both full-amodal and visible bounding box definitions computed per instance. This dataset enables robust cross-domain evaluations and supports fine-grained analysis of occlusion and scale variation, with log-average miss rate as the chief metric (Zhang et al., 2017).
Amodal Cityscapes
Amodal Cityscapes is a synthetic extension for amodal semantic segmentation. It generates images where occluders are copy-pasted from other Cityscapes images, providing two-layer pixel-wise labels: visible semantics and semantics beneath occlusion. The dataset enables training and evaluation with metrics such as visible mIoU, invisible mIoU (restricted to occluded pixels), and total mIoU. Baseline results show substantial gains in occluded-region labeling using a multi-branch amERFNet system (Breitenstein et al., 2022).
Cityscapes Attributes Recognition (CAR)
The CAR annotation layer introduces structured per-instance attributes (e.g., status, form, direction for vehicles and pedestrians; type and status for traffic lights/signs). Over 32,000 instances are annotated in the fine split. Labeling uses a category-specific taxonomy, with platform-driven consensus and multiple annotator redundancy for quality. Queries and visualization are facilitated by an open-source CAR-API Python library (Metwaly et al., 2021).
4. Experimental Protocols, Model Architectures, and Augmentations
Cityscapes data is commonly processed at high resolution (1024×2048 px), with training often exploiting random cropping and standard normalization, though some experiments use lower-res proxies (96×256 px) for architectural efficiency (Hernández-Cámara et al., 2022). U-Net variants, deep transformers, and multi-head models are evaluated, with datasets supporting pixel- and instance-level loss formulations.
Models with architecturally integrated divisive normalization (DN) layers in encoder blocks exhibit improved invariance to contrast and illumination changes, with a reported mIoU improvement of ∼7% (clear) and up to ∼18% (heavy fog) relative to baseline U-Nets (Hernández-Cámara et al., 2022). The DN module is parameterized as
where encodes channel-wise and spatial pooling weights, and is a channel bias.
Augmentation pipelines extend training to rare VRU (vulnerable road user) scenarios via 3D CAD pedestrian insertion, spawn/collision masking, 3D rendering, and adversarial translation to match real-scene illumination (e.g., CycleGAN with class-specific PatchGAN discriminators and cost-sensitive weighting). These augmentations demonstrably improve person AP and IoU without degrading other-class metrics (Savkin et al., 16 Sep 2025).
Multi-task learning frameworks, such as TaskPrompter, leverage prompt-based Swin transformers for joint semantic segmentation, monocular 3D detection, and depth estimation. Joint optimization over task-generic and task-specific token spaces and cross-task attention yield performance improvements relative to single-task or naive multi-task baselines (Ye et al., 2023).
5. Practical Applications and Impact on Autonomous Driving Research
Cityscapes is foundational for multiple research domains:
- Robust deep segmentation: High-resolution, large-scale, and dense scenes necessitate computation- and memory-efficient architectures preserving spatial detail and enabling rigorous evaluation on fine object classes, including crowded and occluded instances (Cordts et al., 2016).
- Pedestrian and VRU detection: CityPersons and augmentations for pedestrian synthesis specifically address generalization and occlusion challenges pivotal for safety-critical perception (Zhang et al., 2017, Savkin et al., 16 Sep 2025).
- 3D scene understanding: Stereo-derived depth and 9-DoF vehicle boxes advance monocular 3D detection, spatial reasoning, and joint 2D–3D learning pipelines for autonomous navigation (Gählert et al., 2020, Ye et al., 2023).
- Fine-grained behavior recognition: The CAR dataset unlocks per-object attribute modeling, enabling context-aware policy learning for behaviors such as intention prediction and risk assessment (Metwaly et al., 2021).
- Amodal and part-level parsing: Extensions such as Amodal Cityscapes and CS-PP support richer semantic reasoning under occlusion and at the compositional level, critical for complex road-scene understanding (Meletis et al., 2020, Breitenstein et al., 2022).
Models trained on Cityscapes demonstrate superior transfer to CamVid and KITTI benchmarks, confirming its value as a generalization driver (Cordts et al., 2016).
6. Access, Tooling, and Community Ecosystem
Data, labels, and code are distributed via https://www.cityscapes-dataset.com and associated GitHub repositories, maintaining alignment among various splits and extensions:
| Extension | Added Annotations | API/Tooling |
|---|---|---|
| Cityscapes-Panoptic-Parts (CS-PP) | Part-level masks (x23) | panoptic_parts.py, COCO export |
| Cityscapes 3D | 3D vehicle boxes (9DoF) | bbox3d/ JSONs, Python loader |
| CityPersons | Full/visible pedestrian | Aligned bounding boxes, occlusion statistics |
| Amodal Cityscapes | Visible/amodal labels | PyTorch training scripts, generation pipelines |
| CAR | Per-instance attributes | car-api Python library, visualization, querying |
The Cityscapes suite is widely used for research publications, competitions, and as a foundational layer for developing/benchmarking autonomous driving perception models across the research community (Cordts et al., 2016, Zhang et al., 2017, Gählert et al., 2020, Meletis et al., 2020, Breitenstein et al., 2022, Metwaly et al., 2021, Ye et al., 2023, Hernández-Cámara et al., 2022, Savkin et al., 16 Sep 2025).