Curated Urban Scenes Dataset

Updated 18 April 2026

Curated urban scene datasets are systematically collected, annotated, and quality-controlled repositories that capture the complex visual, geometric, and semantic properties of metropolitan environments.
They integrate diverse modalities including 2D images, 3D point clouds, and sensor-fusion data to enable reproducible benchmarking and robust algorithm evaluation.
These datasets drive advancements in cooperative perception, semantic segmentation, and urban analytics, supporting innovation in autonomous systems and urban planning research.

A curated dataset of urban scenes is a systematically collected, annotated, and quality-controlled repository designed to capture the geometric, visual, semantic, and sometimes multimodal properties of complex urban environments. Such datasets enable reproducible benchmarking, rigorous comparisons of perception algorithms, and simulation of real-world phenomena under varied urban conditions. The field now encompasses diverse data sources, including ground-based imagery, aerial photogrammetry, LiDAR, multimodal sensor suites, procedural and synthetic scene generation, and even audio-visual and video-language domains.

1. Dataset Types, Modalities, and Scales

Urban scene datasets span multiple sensor modalities and structural levels:

2D Image Datasets: Examples include Cityscapes (30 classes, instance and semantic segmentation, fine-grained pixel-level annotation for 5,000 images, resolutions ~2048×1024 across 50 cities) (Cordts et al., 2016), and TMBuD (160 annotated street-view images, façade edges, Timisoara, Romania) (Ciprian et al., 2021). Scene diversity, annotation granularity, and per-class balance differ.
3D Point Cloud and Mesh Datasets: SensatUrban (2.5–3.75B points, 7.6 km² UK cities, 13 semantic classes, per-point RGB, UAV photogrammetry) (Hu et al., 2022), UrbanBIS (2.5B points, 10.78 km², 3,370 buildings with instance IDs and subcategories) (Yang et al., 2023), TrueCity (real + simulated, cm-accurate registration, 12 CityGML/OpenDRIVE classes) (Nguyen et al., 10 Nov 2025), WHU-PCPR (82.3 km trajectory from both vehicle and helmet-mounted LiDARs) (Zou et al., 10 Jan 2026).
Multi-modal and Sensor-fusion Datasets: UrbanIng-V2X (LiDAR, RGB, thermal, IMU, multi-vehicle/multi-infrastructure across 3 intersections) (Sekaran et al., 27 Oct 2025), UrbanLoco (vehicle with LiDAR, 6 cameras, IMU, GNSS, challenging SF/Hong Kong) (Wen et al., 2019).
Audio-visual and Video-Language Datasets: Urban scene audio-visual corpus (10 scene types, 12,292 clips, binaural audio + video, >12 cities; explicit anonymity protocols) (Wang et al., 2020); UDVideoQA (traffic video, dynamic privacy-preserving blur, 28k question-answer pairs spanning multi-step spatio-temporal reasoning) (Vishal et al., 24 Feb 2026).
Synthetic Scene Datasets: UrbanSyn (procedural Unity+OctaneRender/PBR with explicit occlusion annotation, 7,539 images, 19 Cityscapes classes) (Gómez et al., 2023), VALERIE22 (high-fidelity Blender scenes, rich metadata: occlusion, pixel-level pose, 11 classes) (Grau et al., 2023), LightCity (Blender/Cycles, outdoor urban blocks, 50k images, 300+ HDRIs, per-pixel inverse rendering modalities) (Wang et al., 1 Feb 2026), SkyScenes (CARLA UAV, 33.6k images, 28 semantic classes, dense weather/time/altitude sweeps) (Khose et al., 2023).

The table below summarizes several leading datasets for urban scene research:

Dataset	Modality	Size/Scale	Key Annotations
Cityscapes	Image (RGB)	5k fine, 20k coarse	30 classes, inst. segm.
SensatUrban	3D Point Cloud	~3B pts, 7.6 km²	13 semantic classes
UrbanBIS	3D Point Cloud	2.5B pts, 10.8 km²	inst. buildings, subcat.
UrbanIng-V2X	Multi-modal	34 x 20s scenes, 3 int.	12 RGB, 12 LiDAR, thermal
TrueCity	3D real+synthetic	113M real pts, ~100M sim	12 CityGML, cm-aligned
UrbanSyn	Synthetic image	7,539 images	19 Cityscapes classes
VALERIE22	Synthetic image	7 sequences, many frames	Cityscapes classes, rich GT
UDVideoQA	Video (RGB)	16h, 28k QA pairs	Privacy-preserving, QA

2. Curation Principles and Annotation Protocols

Curated urban datasets are characterized by explicit selection criteria, systematic annotation, and quality-control strategies:

Image and Scene Selection: Representative urban typologies (e.g., downtown, residential, campus), lighting/weather stratification, avoidance of bias towards trivial examples (e.g., unique camera positions, architectural diversity) (Ciprian et al., 2021, Lyu et al., 2018, Hu et al., 2022).
Annotation Granularity: Pixel-wise segmentation (semantic and instance), 3D bounding boxes (UrbanIng-V2X: $\langle x, y, z, w, l, h, \theta \rangle$ ), object tracking (unique IDs per sequence), attribute and subcategory labels (UrbanBIS: 7 function, 3 height classes) (Yang et al., 2023, Sekaran et al., 27 Oct 2025).
Data Quality and QA: Multi-pass manual review, cross-annotator reconciliation, explicit reporting of class imbalance, and challenge-aware split strategies (e.g., intersection-independent vs spatial splits in UrbanIng-V2X; city/exemplar leave-out in Cityscapes/SensatUrban) (Sekaran et al., 27 Oct 2025, Cordts et al., 2016).
Temporal, Multimodal, and Multi-agent Alignment: Coordinated recording (e.g., GPS/PTP sync in UrbanIng-V2X, IMU-driven timestamping), inter-sensor calibration (checkerboard, RTK-placed cones, extrinsic/intrinsic parameter recovery), and global referencing (ENU, UTM, or local CRS) (Sekaran et al., 27 Oct 2025, Nguyen et al., 10 Nov 2025).

3. Benchmarking, Data Splits, and Evaluation Metrics

Urban scene datasets typically provide training/validation/test splits, with various strategies to control for scene variability and data leakage:

Split Protocols: Per-intersection (leave-one-intersection-out, UrbanIng-V2X), per-city (Cityscapes), spatially contiguous tiles (SensatUrban), random and proportional balancing (Sekaran et al., 27 Oct 2025, Cordts et al., 2016, Hu et al., 2022).
Evaluation Metrics:
- 2D segmentation: Intersection over Union (IoU), mean IoU ( $\mathrm{mIoU}=\frac{1}{C} \sum_{c=1}^C \mathrm{IoU}_c$ ), pixel accuracy, F1-score (Cordts et al., 2016, Ciprian et al., 2021).
- 3D segmentation: Per-class and mean IoU, accuracy per class, overall accuracy (Hu et al., 2022, Yang et al., 2023).
- Building/instance segmentation: Average Precision (AP) at multiple IoU thresholds (e.g., [email protected]), mean over thresholds (Yang et al., 2023).
- Object detection/tracking: mean Average Precision (mAP); 3D box IoU; vehicle, pedestrian breakdown (Sekaran et al., 27 Oct 2025).
- Place recognition: Recall@K, Precision@K based on spatial distance thresholds (Zou et al., 10 Jan 2026).
- Audio-visual/video QA: overall classification accuracy, fusion method breakdown (Wang et al., 2020, Vishal et al., 24 Feb 2026).
Baseline Results: SOTA models are routinely benchmarked (DeepLabV3+, MS-Dilation, PointNet/KPConv, Point Transformerv1/v3, HRDA, co-training, F-Cooper, AttFuse, CoBEVT, etc.) (Nguyen et al., 10 Nov 2025, Grau et al., 2023, Lyu et al., 2018).

4. Key Challenges and Insights from Curation

Several challenges are recurrently identified in the literature:

Class Imbalance and Long-tail Distribution: In SensatUrban, rails and bikes are under 0.1% of points, with models unable to capture minority classes without specialized loss/sampling strategies (Hu et al., 2022).
Domain Adaptation and Generalization: TrueCity quantifies severe sim–real domain gaps (e.g., mIoU: PointNet 100S–0R = 6.03%, 0S–100R = 14.51%), with best results when mixing synthetic and real, particularly for transformer-based architectures (Nguyen et al., 10 Nov 2025).
Annotation and Data Preparation at Scale: Managing billion-point clouds (SensatUrban, UrbanBIS), full 3D reconstruction, and high-frequency (e.g., 10 Hz) annotation pipelines requires robust tiling, downsampling, and spot-checking protocols (Hu et al., 2022, Yang et al., 2023).
Spatial and Environmental Diversity: Overfitting to a single intersection, city, or layout produces misleadingly high test results (e.g., UrbanIng-V2X reports a 14 pp mAP drop on unseen intersection splits) (Sekaran et al., 27 Oct 2025).

A plausible implication is that universal benchmarks must span multiple cities, scenes, and acquisition modalities to achieve robust real-world generalization.

5. Applications and Benchmarks in Research

Curated urban scene datasets have enabled multiple research thrusts:

Cooperative Perception: Multi-agent detection, trajectory prediction, and V2X communication efficiency (UrbanIng-V2X, UrbanLoco) (Sekaran et al., 27 Oct 2025, Wen et al., 2019).
Semantic and Instance Segmentation: Dense pixel/pointwise labeling across modalities and viewpoints, fine-grained building classification (Cordts et al., 2016, Yang et al., 2023, Lyu et al., 2018).
Inverse Rendering and Scene Simulation: Controllable illumination, physically-based rendering, evaluation of scale-invariant metrics for relighting and intrinsic decomposition (LightCity) (Wang et al., 1 Feb 2026, Grau et al., 2023).
Urban Analytics and Planning: Road network extraction, urban heat/vegetation mapping, aerial path planning (UrbanScene3D, Spectrascapes) (Lin et al., 2021, Gupta et al., 14 Apr 2026).
Multimodal and Language-grounded Reasoning: Video question answering, touristic recommendation and map-based comprehension (TraveLLaMA, UDVideoQA) (Chu et al., 23 Apr 2025, Vishal et al., 24 Feb 2026).

6. Accessibility, Licensing, and Community Impact

Datasets are typically made openly available under research licenses (CC-BY, CC-BY-NC), with detailed repositories, tools, and codebases distributed for straightforward integration:

Open-access Portals: SensatUrban (http://point-cloud-analysis.cs.ox.ac.uk) (Hu et al., 2022), UrbanIng-V2X (https://github.com/thi-ad/UrbanIng-V2X) (Sekaran et al., 27 Oct 2025), UrbanBIS (https://vcc.tech/UrbanBIS) (Yang et al., 2023), TrueCity (https://tum-gis.github.io/TrueCity/) (Nguyen et al., 10 Nov 2025), Cityscapes (www.cityscapes-dataset.net) (Cordts et al., 2016), UrbanSyn (www.urbansyn.org) (Gómez et al., 2023).
Tooling and Utilities: ROS bag/file viewers (Complex Urban LiDAR Data Set, UrbanLoco), developer toolkits (OpenCOOD/nuscenes converters), code for Neo/CRF/3D post-processing (UAVid), custom annotation tools (UDVideoQA) (Jeong et al., 2018, Wen et al., 2019, Sekaran et al., 27 Oct 2025, Lyu et al., 2018, Vishal et al., 24 Feb 2026).

Such openly released, richly annotated, and multi-tiered datasets have become foundational resources for scene understanding, multi-agent perception, simulation-to-reality adaptation, and multimodal learning, directly driving advancements in both academic and industrial autonomous systems research.