- The paper presents the CROWD dataset, which provides minute-scale, continuous urban dashcam footage to overcome dataset bias in current driving studies.
- It details a robust methodology using YOLOv11x and BoT-SORT for pre-annotating 51,753 segments, facilitating reproducible object detection and tracking.
- The dataset’s global coverage from 7,103 localities in 238 countries supports comprehensive cross-domain evaluations and socioeconomic traffic analyses.
A Global Dataset of Continuous Urban Dashcam Driving: Overview and Contributions
Motivation and Positioning
The paper introduces the CROWD (City Road Observations With Dashcams) dataset, which addresses longstanding limitations in naturalistic urban driving datasets for computer vision and driving behavior research. CROWD diverges from prevailing datasets by curating minute-scale, temporally linked, unedited, front-facing dashcam footage drawn from globally distributed YouTube uploads. The curation protocol explicitly excludes crash-oriented or incident-focused segments, centering the data on routine urban driving under highly diverse, ecologically valid conditions. This focus directly targets dataset bias and domain shift issues that have limited transferability of perception and interaction models developed on geographically or contextually narrow driving datasets.
Dataset Construction and Characteristics
CROWD encompasses 51,753 curated segment records, corresponding to over 20,275 hours of footage, from 42,032 unique YouTube uploads. Geographic breadth is unprecedented, with coverage of 7,103 named localities in 238 countries and territories across all inhabited continents. The collection protocol prioritized urban environments and enforced strict exclusion of edited, non-continuous, or incident-centric segments, resulting in extended natural sequences ideal for behavioral and robustness analyses.
Each segment is annotated with manual labels for time of day (day/night) and platform type (vehicle class), producing reliable metadata for stratified analysis. Object detection outputs, covering all MS COCO classes, and segment-local multi-object tracks were generated using YOLOv11x coupled with BoT-SORT tracking, and distributed as ready-to-use per-segment CSV files. This approach streamlines benchmarking and circumvents the requirement for users to rerun computationally expensive detection pipelines.
Notably, the dataset is distributed as segment indices and derived annotations (detections, tracks), with no rehosting of the original video content. Users retrieve primary videos directly from YouTube using provided identifiers and segment boundaries, ensuring legal compliance and reducing data redistribution friction.
Comparative Perspective and Domain Implications
Existing urban driving and VRU datasets (e.g., KITTI [geiger2012we], Cityscapes [cordts2016cityscapes], BDD100K [yu2020bdd100k], JAAD [kotseruba2016joint], EuroCity Persons [braun2018eurocity], D2-City [che2019d]) have enabled benchmark-driven progress in perception, tracking, and forecasting. However, nearly all have significant limitations: narrow geographic coverage (often one or a few cities/countries), focus on short, discrete scenarios or selected events rather than natural temporal streams, and sensor perspectives that do not always align with the consumer dashcam geometry prevalent in deployed ADAS. As a result, model performance and generalization across unseen domains remain hampered by distribution shift [torralba2011unbiased].
CROWD contrasts sharply in scope and design:
- Temporal scale and continuity: Retained segments (~5 minutes) are substantially longer than the typical 15–40 second clips in major benchmarks, enabling extended context analyses for exposure, human-vehicle interactions, and locomotor behavior.
- Geographic and cultural diversity: With coverage of virtually all countries and thousands of urban locales, the dataset is highly robust to regional domain idiosyncrasies.
- Unbiased routine capture: Selection rules systematically avoid the incident-focused bias (crashes, near-misses, confrontations) pervasive in web-mined dashcam corpora and many prior benchmarks [fang2019dada].
- Object detection and tracking: All segments are pre-annotated with frame-level bounding box outputs and track IDs, supporting reproducible baselines and large-scale cross-benchmark experiments.
- Rich contextual and socioeconomic metadata: Locality records are enhanced with population statistics, road traffic mortality, income indices, and traffic indices to enable multilevel or cross-sectional correlational analysis.
Technical Validation
A comprehensive validation pipeline was implemented to ensure consistency and integrity:
- Structural alignment of mapping and metadata tables, with parsing of list-encoded segment information.
- Recalculation and verification of all global dataset statistics, with explicit reproduction of the aggregate measures reported in the manuscript.
- Integrity checking of per-segment detection files, including correspondence to mapping indices and detection file coverage rates.
- Hash-based file manifest for reproducibility and future benchmarking.
The release protocol and dataset structure are optimized to minimize ambiguity in content referencing and annotation alignment.
Limitations and Usage Considerations
Despite its scale and breadth, some limitations persist:
- Day/night imbalance: Most segments are day-time; analyses requiring robust nighttime coverage must apply stratification or weighting.
- Geographic data density: Coverage in Africa and parts of South America is sparser relative to population size, constraining cross-continent equivalency.
- Content sensitivity: Although personally identifiable video data are not redistributed, the underlying content may feature faces or license plates. The dataset adheres to ethical guidelines and leaves privacy-sensitive handling responsibilities with downstream users.
All detection and tracking outputs are COCO-class-bound and do not cover lane geometry or full traffic sign taxonomies. The object tracker is re-initialized per segment, so track IDs are not persistent across discontinuous sequences.
Implications and Future Directions
The CROWD dataset provides a step-change in resources available for addressing domain generalization, robustness, spatiotemporal modeling, and behavioral ecology in traffic contexts. Its scale and scope foster research into:
- Cross-domain evaluation and adaptation of perception and interaction models across highly variable environments.
- Behavioral science studies requiring representative, longitudinal samples of routine urban exposure and VRU-vehicle interaction metrics.
- Socioeconomic and urban science applications linking contextual indicators (e.g., traffic mortality, income inequality) to observed traffic dynamics.
- Development of protocols and tools for scalable, low-cost curation of web video as a data source for AI and transportation research.
Extension opportunities include annotation expansion (more diverse label sets, higher-level behaviors), systematic inclusion of rural or adverse driving, and community-driven continuous curation to manage upload churn and evolving web content.
Conclusion
CROWD establishes a new benchmark for global, continuous, routine urban dashcam video collection. Its scale, curation rigor, and annotation strategy directly support advances in generalization and real-world applicability of driving AI and interaction modeling. The dataset’s design enables both large-scale perception research and novel multimodal, multidomain behavioral analyses, reducing dataset bias and supporting reproducibility through transparent metadata, code release, and detailed structural validation.
Citation: "A global dataset of continuous urban dashcam driving" (2604.01044)