DeepScenario 3D Dataset
- DeepScenario Dataset is a publicly available, occlusion-free 3D dataset collected via drone imaging to enable advanced motion prediction and planning models.
- It employs a rigorous six-stage pipeline combining SfM, MVS, semantic segmentation, and 3D Kalman filtering for precise scene reconstruction and annotation.
- The dataset spans diverse urban, suburban, highway, and parking scenarios with 177,151 unique trajectories, supporting innovative benchmarks in autonomous driving.
The DeepScenario Open 3D Dataset (DSC3D) is an extensive, publicly available resource designed to address critical limitations in autonomous driving research involving trajectory reconstruction. DSC3D offers high-fidelity, occlusion-free 3D bounding box trajectories of traffic participants, captured using a monocular camera drone tracking pipeline. Spanning geographically and functionally diverse urban, suburban, highway, and parking environments across Europe and the United States, the dataset enables new avenues of research in motion prediction, planning, scenario mining, and simulation with unprecedented scale and annotation depth.
1. Motivations and Novel Contributions
DSC3D was conceived to offset the deficiencies of traditional ego-vehicle and fixed infrastructure datasets, which are constrained by occlusion and limited fields of view—particularly in congested urban areas. Employing aerial drones, DSC3D achieves full-scene coverage for all categories of road users, including vulnerable road users (VRUs), and annotation granularity surpassing prior benchmarks. Notable innovations relative to existing datasets include:
- Acquisition at five distinct locations: Munich inner city, Berlin federal highway, Stuttgart T-intersection, Sindelfingen parking lot, San Francisco steep intersection, totaling 15 hours of HD video and 177,151 unique trajectories.
- Largest drone-based classification scheme of its kind, with 14 traffic participant classes (cars, buses, trucks, motorcycles, scooters, bicycles, pedestrians, animals, "other").
- Metric-accurate 6-DoF bounding boxes (center, dimensions, orientation) stored in local UTM coordinates, with object poses encoded in SE(3).
- Geo-referenced HD maps in OpenDRIVE format and detailed 3D meshes for each scene.
- A median positional accuracy of 4.8 cm and depth reconstruction error below 15 cm.
- Interactive online visualization and download platform (https://app.deepscenario.com).
2. Data Capture and Processing Pipeline
DSC3D employs a structured six-stage pipeline:
2.1 Data Collection
DJI drones with downward-tilted RGB cameras operate in mapping and recording passes at 25 Hz. Mapping captures GPS-stamped images across the whole area, while recording pass acquires densely sampled video frames from stationary drone positions.
2.2 3D Scene Reconstruction
Structure-from-Motion (SfM) estimates initial camera poses (), followed by Multi-View Stereo (MVS) to densify the point cloud. Bundle Adjustment (BA) minimizes a compound loss:
where is the camera center in local UTM coordinates. This yields a textured triangular mesh .
2.3 Ground Surface and Map Generation
Semantic segmentation of orthophotos isolates road regions, from which dense 3D sampling and filtering enables fitting a continuous NURBS surface ("FlexRoad", Editor's term) as ground model. This model supports accurate HD mapping in OpenDRIVE, with elevation profiles, lane geometry, and junction continuity.
2.4 Frame Calibration
LoFTR/LightGlue matches frame 2D points to 3D scene points, with extrinsics () and intrinsics () optimized using RANSAC. Temporal consistency enforced by a Kalman filter.
2.5 Monocular 3D Object Detection
GroundMix, a monocular single-stage detector, predicts 2D boxes, class, 3D dimensions, orientation (Euler angles), depth, and projected ground center for each frame. The ground-aware center is refined using camera-to-object ray intersection with the ground mesh. Rotation matrices are constructed as:
and projected to world coordinates:
2.6 3D Object Tracking
3D Kalman filter assigns tracks and computes velocities; Rauch–Tung–Striebel smoothing refines trajectories. Active learning with manual labels ensures detection quality.
3. Dataset Composition and Statistics
DSC3D comprises 177,151 unique tracks, some as long as 984 seconds, encompassing 5,395 km of travel.
| Class | Trajectories |
|---|---|
| Pedestrians | 140,227 |
| Bicycles | 17,736 |
| Cars | 13,241 |
| Scooters | 1,475 |
| Motorcycles | 1,054 |
| Animals | 677 |
| Trucks | 475 |
| Buses | 191 |
| Other | 2,075 |
Six additional fine-grained categories include e-scooters and delivery robots. Scene types are designated DSC-MUC, DSC-BER, DSC-SIFI, DSC-STR, and DSC-SFO, each representing distinct traffic, geomorphology, and interaction patterns. Conditions include road grades up to 20%, dense pedestrian zones, various intersection types, and complex parking events. No canonical train/validation/test split is provided; users define partitions suitable for their applications.
4. Annotation Modality and Coordinate Systems
Annotations use frame-wise metric 6-DoF bounding boxes:
- Center: in world coordinates.
- Dimensions: length , width , height (m).
- Orientation: rotation matrix (SO(3)); for ground vehicles often reduced to yaw about the world-vertical.
Coordinate definitions:
- World frame: Local UTM east-north-up (ENU), origin per scene.
- Camera frame at time : centered at drone optical center; axes per .
- Transform from camera to world: .
- Euler decomposition:
Rotation matrices are defined per standard conventions (see technical details in source).
5. Data Modalities and Access Patterns
Released data modalities include:
- Raw video streams (HD, 25 Hz)
- Per-frame calibration files (, , )
- 3D track files in ASCII/JSON/CSV: frame_id, track_id, class_id, position, dimensions, orientation, velocity, acceleration
- Scene HD maps (OpenDRIVE .xodr), 3D meshes (.obj/.ply)
- Metadata: GPS/IMU logs, intrinsic matrices, ground mesh coefficients
Data are discoverable and interactively visualized/downloaded at https://app.deepscenario.com, organized by scene (subfolders: “video,” “calibration,” “tracks,” “maps,” “meshes”).
6. Applications and Benchmarking
DSC3D supports multiple research paradigms:
- Motion Prediction & Planning (DeepUrban benchmark): 20-second clips from four scenes, metrics include Average Displacement Error (ADE), Final Displacement Error (FDE), and collision scores. Supplementing nuScenes training with DeepUrban yields a ~44% improvement in ADE/FDE.
- Traffic Rule Compliance: Quantitative evaluation of gap distances, time-to-collision (TTC), post-encroachment time (PET), evidencing realistic human driving patterns.
- Scenario Mining: Analysis of parking maneuvers (time-to-park, direction reversals) and intersection events (TTC, PET ranges).
- Generative Reactive Agents: Models such as BehaviorGPT, Versatile Behavior Diffusion, and TrafficBots v1.5 trained on DSC3D enable closed-loop, interactive traffic simulations mirroring real scene dynamics.
7. Limitations and Prospects
Recognized constraints and future extensions:
- Remaining occlusions under dense foliage or structures; detection of extremely low-lying objects (animals) is challenging.
- GPS absolute alignment RMSE 1.9 m; local 3D reconstruction error 15 cm. Use of ground control points (GCPs) could reduce systematic offset.
- Absence of traffic signal state annotations; future versions to integrate signal timing and signage.
- Detection/tracking rely on monocular sensing; potential integration of multi-drone or LiDAR for enhanced clutter robustness.
- No standardized data splits; researchers must establish partitions appropriate to their use case.
DSC3D constitutes a comprehensive resource for the autonomous-driving community, distinguished by its coverage, annotation precision, and public accessibility, aiming to catalyze research in safety-critical motion prediction, planning, and generative traffic simulation.