Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

DeepScenario 3D Dataset

Updated 13 November 2025
  • DeepScenario Dataset is a publicly available, occlusion-free 3D dataset collected via drone imaging to enable advanced motion prediction and planning models.
  • It employs a rigorous six-stage pipeline combining SfM, MVS, semantic segmentation, and 3D Kalman filtering for precise scene reconstruction and annotation.
  • The dataset spans diverse urban, suburban, highway, and parking scenarios with 177,151 unique trajectories, supporting innovative benchmarks in autonomous driving.

The DeepScenario Open 3D Dataset (DSC3D) is an extensive, publicly available resource designed to address critical limitations in autonomous driving research involving trajectory reconstruction. DSC3D offers high-fidelity, occlusion-free 3D bounding box trajectories of traffic participants, captured using a monocular camera drone tracking pipeline. Spanning geographically and functionally diverse urban, suburban, highway, and parking environments across Europe and the United States, the dataset enables new avenues of research in motion prediction, planning, scenario mining, and simulation with unprecedented scale and annotation depth.

1. Motivations and Novel Contributions

DSC3D was conceived to offset the deficiencies of traditional ego-vehicle and fixed infrastructure datasets, which are constrained by occlusion and limited fields of view—particularly in congested urban areas. Employing aerial drones, DSC3D achieves full-scene coverage for all categories of road users, including vulnerable road users (VRUs), and annotation granularity surpassing prior benchmarks. Notable innovations relative to existing datasets include:

  • Acquisition at five distinct locations: Munich inner city, Berlin federal highway, Stuttgart T-intersection, Sindelfingen parking lot, San Francisco steep intersection, totaling 15 hours of HD video and 177,151 unique trajectories.
  • Largest drone-based classification scheme of its kind, with 14 traffic participant classes (cars, buses, trucks, motorcycles, scooters, bicycles, pedestrians, animals, "other").
  • Metric-accurate 6-DoF bounding boxes (center, dimensions, orientation) stored in local UTM coordinates, with object poses encoded in SE(3).
  • Geo-referenced HD maps in OpenDRIVE format and detailed 3D meshes for each scene.
  • A median positional accuracy of 4.8 cm and depth reconstruction error below 15 cm.
  • Interactive online visualization and download platform (https://app.deepscenario.com).

2. Data Capture and Processing Pipeline

DSC3D employs a structured six-stage pipeline:

2.1 Data Collection

DJI drones with downward-tilted RGB cameras operate in mapping and recording passes at 25 Hz. Mapping captures GPS-stamped images across the whole area, while recording pass acquires densely sampled video frames from stationary drone positions.

2.2 3D Scene Reconstruction

Structure-from-Motion (SfM) estimates initial camera poses (TinitIiT^{I_i}_\mathrm{init}), followed by Multi-View Stereo (MVS) to densify the point cloud. Bundle Adjustment (BA) minimizes a compound loss:

L=Lreproj+λicIiglocalIi2,L = L_\mathrm{reproj} + \lambda \sum_i \|c^{I_i} - g^{I_i}_\mathrm{local}\|^2,

where cIic^{I_i} is the camera center in local UTM coordinates. This yields a textured triangular mesh (V,F)(\mathcal{V}, \mathcal{F}).

2.3 Ground Surface and Map Generation

Semantic segmentation of orthophotos isolates road regions, from which dense 3D sampling and filtering enables fitting a continuous NURBS surface ("FlexRoad", Editor's term) as ground model. This model supports accurate HD mapping in OpenDRIVE, with elevation profiles, lane geometry, and junction continuity.

2.4 Frame Calibration

LoFTR/LightGlue matches frame 2D points to 3D scene points, with extrinsics (TFtT^{F_t}) and intrinsics (KK) optimized using RANSAC. Temporal consistency enforced by a Kalman filter.

2.5 Monocular 3D Object Detection

GroundMix, a monocular single-stage detector, predicts 2D boxes, class, 3D dimensions, orientation (Euler angles), depth, and projected ground center for each frame. The ground-aware center is refined using camera-to-object ray intersection with the ground mesh. Rotation matrices are constructed as:

Rc=RZ(ψ)RY(θ)RX(ϕ)R_c = R_Z(\psi) R_Y(\theta) R_X(\phi)

and projected to world coordinates:

Xw=RFtXc+tFtX_w = R^{F_t} X_c^* + t^{F_t}

2.6 3D Object Tracking

3D Kalman filter assigns tracks and computes velocities; Rauch–Tung–Striebel smoothing refines trajectories. Active learning with manual labels ensures detection quality.

3. Dataset Composition and Statistics

DSC3D comprises 177,151 unique tracks, some as long as 984 seconds, encompassing 5,395 km of travel.

Class Trajectories
Pedestrians 140,227
Bicycles 17,736
Cars 13,241
Scooters 1,475
Motorcycles 1,054
Animals 677
Trucks 475
Buses 191
Other 2,075

Six additional fine-grained categories include e-scooters and delivery robots. Scene types are designated DSC-MUC, DSC-BER, DSC-SIFI, DSC-STR, and DSC-SFO, each representing distinct traffic, geomorphology, and interaction patterns. Conditions include road grades up to 20%, dense pedestrian zones, various intersection types, and complex parking events. No canonical train/validation/test split is provided; users define partitions suitable for their applications.

4. Annotation Modality and Coordinate Systems

Annotations use frame-wise metric 6-DoF bounding boxes:

  • Center: (xw,yw,zw)(x_w, y_w, z_w) in world coordinates.
  • Dimensions: length ll, width ww, height hh (m).
  • Orientation: rotation matrix RwR_w (SO(3)); for ground vehicles often reduced to yaw ψ\psi about the world-vertical.

Coordinate definitions:

  • World frame: Local UTM east-north-up (ENU), origin per scene.
  • Camera frame at time tt: centered at drone optical center; axes per TFtT^{F_t}.
  • Transform from camera to world: Xw=RFtXc+tFtX_w = R^{F_t} X_c + t^{F_t}.
  • Euler decomposition:

R(ϕ,θ,ψ)=RZ(ψ)RY(θ)RX(ϕ)R(\phi,\theta,\psi) = R_Z(\psi) R_Y(\theta) R_X(\phi)

Rotation matrices are defined per standard conventions (see technical details in source).

5. Data Modalities and Access Patterns

Released data modalities include:

  • Raw video streams (HD, 25 Hz)
  • Per-frame calibration files (KK, RFtR^{F_t}, tFtt^{F_t})
  • 3D track files in ASCII/JSON/CSV: frame_id, track_id, class_id, position, dimensions, orientation, velocity, acceleration
  • Scene HD maps (OpenDRIVE .xodr), 3D meshes (.obj/.ply)
  • Metadata: GPS/IMU logs, intrinsic matrices, ground mesh coefficients

Data are discoverable and interactively visualized/downloaded at https://app.deepscenario.com, organized by scene (subfolders: “video,” “calibration,” “tracks,” “maps,” “meshes”).

6. Applications and Benchmarking

DSC3D supports multiple research paradigms:

  • Motion Prediction & Planning (DeepUrban benchmark): 20-second clips from four scenes, metrics include Average Displacement Error (ADE), Final Displacement Error (FDE), and collision scores. Supplementing nuScenes training with DeepUrban yields a ~44% improvement in ADE/FDE.
  • Traffic Rule Compliance: Quantitative evaluation of gap distances, time-to-collision (TTC), post-encroachment time (PET), evidencing realistic human driving patterns.
  • Scenario Mining: Analysis of parking maneuvers (time-to-park, direction reversals) and intersection events (TTC, PET ranges).
  • Generative Reactive Agents: Models such as BehaviorGPT, Versatile Behavior Diffusion, and TrafficBots v1.5 trained on DSC3D enable closed-loop, interactive traffic simulations mirroring real scene dynamics.

7. Limitations and Prospects

Recognized constraints and future extensions:

  • Remaining occlusions under dense foliage or structures; detection of extremely low-lying objects (animals) is challenging.
  • GPS absolute alignment RMSE \approx 1.9 m; local 3D reconstruction error \approx 15 cm. Use of ground control points (GCPs) could reduce systematic offset.
  • Absence of traffic signal state annotations; future versions to integrate signal timing and signage.
  • Detection/tracking rely on monocular sensing; potential integration of multi-drone or LiDAR for enhanced clutter robustness.
  • No standardized data splits; researchers must establish partitions appropriate to their use case.

DSC3D constitutes a comprehensive resource for the autonomous-driving community, distinguished by its coverage, annotation precision, and public accessibility, aiming to catalyze research in safety-critical motion prediction, planning, and generative traffic simulation.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepScenario Dataset.