Waymo Open Dataset: Autonomous Driving Data
- The Waymo Open Dataset is a large-scale, multimodal dataset providing high-fidelity sensor data and rich annotations for autonomous driving research.
- It supports diverse tasks such as 2D/3D detection, tracking, motion forecasting, and semantic segmentation with rigorous benchmarks and detailed evaluation metrics.
- Its extensive sensor suite—including cameras, LiDAR, and HD maps—enables robust analysis of multi-agent interactions and real-world driving challenges.
The Waymo Open Dataset (WOD) is a large-scale, multimodal dataset designed to advance research in autonomous driving through the provision of high-fidelity, diverse, and richly annotated sensor data. Released to the academic community, WOD and its subsequent extensions have become foundational resources for perception, tracking, motion forecasting, behavior modeling, multimodal reasoning, semantic segmentation, and end-to-end learning in real-world driving scenarios.
1. Dataset Composition and Scale
The core WOD comprises 1,150 twenty-second driving segments, each collected using fleets of Waymo autonomous vehicles operating across urban and suburban geographies in the United States. The primary splits include 798 training scenes, 202 validation scenes, and 150 held-out test scenes. Sensor coverage encompasses five high-resolution cameras and five calibrated spinning LiDARs per vehicle, all synchronized at 10 or 20 Hz, yielding approximately 1.15 million images and over 9 million 2D video boxes across the dataset (Sun et al., 2019).
Major extensions of WOD include:
- Waymo Open Motion Dataset (WOMD): >104,000 segments, 570 hours, emphasizing multi-agent interactions and motion prediction at 10 Hz (Ettinger et al., 2021).
- WOMD-LiDAR: Adds raw, compressed LiDAR range images for the full motion dataset (>574 hours across 104,000 scenes), enabling end-to-end sensor-to-prediction learning (Chen et al., 2023).
- WOD-E2E: 4,021 high-difficulty, long-tail scenario driving segments (≈12 hours), focused on rare and safety-critical situations for end-to-end driving research (Xu et al., 30 Oct 2025).
- TASK-SPECIFIC LAYERS: ROAD-Waymo adds 198,000 frames with fine-grained agent-action-location ("event") labels (Khan et al., 3 Nov 2024), while WOD-PVPS (panoramic video panoptic segmentation) labels 100,000 images for 28 semantic categories with full temporal and cross-camera ID consistency (Mei et al., 2022).
2. Data Modalities, Annotations, and Access
WOD provides multimodal data for each segment, including:
- LiDAR: Five overlapping 360° point clouds; each sweep contains 100–200k points. Calibrations ensure spatial alignment to the ego vehicle.
- Cameras: Five (in some extensions, up to eight) high-resolution, synchronously captured images, with full intrinsic and extrinsic calibration.
- HD-Maps: High-definition vectorized polylines and polygons for lane centerlines, lane boundaries (with type), crosswalks, stop signs, traffic lights, and road boundaries, all referenced to a global or UTM coordinate system. Map sampling is at 0.5 m resolution (Ettinger et al., 2021).
- Trajectory and Instance Labels: Exhaustive 2D and 3D bounding boxes for vehicles, pedestrians, cyclists, and signs, including unique track IDs with temporal consistency.
- Specialized Annotations: For WOD-PVPS, pixelwise semantic and instance IDs with panoramic cross-camera and temporal association; for ROAD-Waymo, multi-tuple event labelings (agent, action, location) (Khan et al., 3 Nov 2024).
- Motion Forecasting: Past and future trajectories of agents, including "interactive" agent pairs and semantic interaction flags; in WOMD, the "objects_of_interest" subset targets challenging agent interactions.
Data are distributed in TFRecord proto format, with Python/C++ APIs, evaluation kits, and visualization tools available at the official repository.
3. Supported Research Tasks and Benchmarks
WOD and its successors support a broad suite of benchmarks:
- 2D Object Detection: Single-frame and video-based, using high-resolution imagery, targeting small-object recall and class-specific IoU thresholds (vehicle τ≥0.7, pedestrian/cyclist τ≥0.5) (Huang et al., 2020, Zhang et al., 2021).
- 3D Object Detection and Tracking: LiDAR (optionally camera-aided) multi-class detection (vehicles, pedestrians, cyclists) and tracking, with metrics such as mAP, mAPH (heading-aware), MOTA, and MOTP (Sun et al., 2019, Wang et al., 2020, Ding et al., 2020).
- Motion Forecasting: Marginal and joint multi-agent prediction over an 8 s horizon, exploiting HD-maps for context, and offering metrics minADE, minFDE, MR, Overlap Rate, and forecasting mAP with mode calibration (Ettinger et al., 2021, Gu et al., 2021).
- Semantic & Panoptic Segmentation: 3D LiDAR point-wise classification (22/28 classes), panoramic image/video panoptic segmentation (28 classes, ID-aware spatiotemporal association), with PQ, mIoU, STQ, and wSTQ metrics (Mei et al., 2022, Wu et al., 21 Jul 2024, Wu, 6 Jan 2025).
- Action and Event Understanding: Multi-label agent-action-location event detection and tube-based video mAP (in ROAD-Waymo) (Khan et al., 3 Nov 2024).
- End-to-End Driving: Direct prediction of vehicle trajectory or control signals from multi-sensor (image, LiDAR, map) input under long-tail and rare-event scenarios, rated with Rater Feedback Score (RFS) (Xu et al., 30 Oct 2025).
- Interactive Reasoning: Q&A benchmarks (WOMD-Reasoning) of agent interaction, map attributes, and traffic law compliance (Li et al., 5 Jul 2024).
4. Unique Challenges and Design Features
WOD's scale and structure present distinctive algorithmic challenges:
- Diversity: Scenes span multiple metro areas, day/night, weather, and traffic compositions, with a coverage area of 76 km²—an order of magnitude above prior camera+LiDAR datasets (Sun et al., 2019).
- Small Object Recall: ≈70% of objects in images are below COCO “small” size threshold, demanding large-scale, context-aware detection approaches (Huang et al., 2020).
- High Object Density: 20–250 agents tracked per frame in dense downtown segments.
- Temporal Consistency: Unique ID assignment across cameras and frames, essential for video-level panoptic segmentation and multi-object tracking (Mei et al., 2022).
- Multi-Agent Interactions: Explicit joint forecasting splits, mined interacting pairs, and risk-aware labeling enable the paper of negotiation, yielding, and mutual influence (Ettinger et al., 2021, Puphal et al., 30 Jun 2025).
- Long-tail Distribution: In WOD-E2E, rare scenarios (frequency <0.03%) such as construction anomalies or non-nominal hazards are systematically represented and evaluated (Xu et al., 30 Oct 2025).
5. Notable Benchmarks, State-of-the-art Methods, and Quantitative Results
WOD has spurred extensive competitive benchmarks producing state-of-the-art models in perception and planning:
| Task | Best Reported Method/Result | Reference |
|---|---|---|
| 2D Detection | FPN/Cascade RCNN/Stacked PAFPN/Double-Head, 74.43 mAP | (Huang et al., 2020) |
| Real-time 2D Detection | YOLOR + TensorRT FP16, 75.0% L1 mAP, 45.8 ms/frame | (Zhang et al., 2021) |
| 3D Detection | AFDet, Densified+Painting+TTA/WBF Ensemble, 78.49% mAP | (Ding et al., 2020) |
| 2D/3D Multi-Object Tracking | HorizonMOT: 45.13% (2D), 63.45% (3D) MOTA/L2 | (Wang et al., 2020) |
| Motion Forecasting | DenseTNT: minADE 1.0387 m, mAP: 0.3281 | (Gu et al., 2021) |
| Semantic Segmentation | PTv3-Extreme: mIoU 72.76 (ensemble, 3-frame, no-clipping) | (Wu et al., 21 Jul 2024) |
| Panoptic Segmentation (PVPS) | ViP-DeepLab: wSTQ 17.78, PQ 40.00 | (Mei et al., 2022) |
| Action/Event Detection | 12.4M labels, event-aware mAP >3x harder in cross-domain | (Khan et al., 3 Nov 2024) |
| End-to-End Driving (E2E) | RFS, multi-human-rated reference tracking | (Xu et al., 30 Oct 2025) |
Ablation studies confirm that architectural improvements such as multi-scale image augmentation, anchor refinement, transformer-based scene encoding, and aggressive data mixing in segmentation, deliver systematic gains. The introduction of risk-based filtering identifies additional high-value scenarios beyond legacy salience or semantic rule-based metrics (Puphal et al., 30 Jun 2025).
6. Extension Datasets and Novel Evaluation Protocols
Recent years have seen the emergence of extension datasets and new benchmarks:
- WOD-PVPS introduces temporally and spatially consistent panoptic segmentation across multiple cameras and frames, with the STQ and wSTQ metrics compensating for panoramic overlap (Mei et al., 2022).
- ROAD-Waymo provides action-aware event annotation covering agent, action, and spatial context, enabling evaluation on event detection and domain adaptation tasks (cross-UK/US) (Khan et al., 3 Nov 2024).
- WOMD-Reasoning adds 3 million Q&A pairs spanning map recognition, motion narratives, interaction reasoning, and intention prediction—enabling joint vision-language, intent, and multi-agent inference benchmarking (Li et al., 5 Jul 2024).
- WOD-E2E defines the Rater Feedback Score (RFS), a human-in-the-loop, trust-region-based evaluation for rare, safety-critical decision scenarios, providing a more robust target than classical displacement-based open-loop metrics (Xu et al., 30 Oct 2025).
7. Limitations, Behavioral Representation, and Future Directions
Validation against independently recorded datasets reveals systematic under-representation of short headways, abrupt braking, and high-risk behaviors in WOMD due to proprietary smoothing, 20-second clip structure, and unquantified localization error (Zhang et al., 3 Sep 2025). Behavioral models fit solely to WOMD will tend to underestimate real-world variability and tail risk. It is recommended to validate learned behavioral models and planners against independent, high-fidelity naturalistic datasets, employing error-aware corrections such as SIMEX.
Ongoing developments point to several frontiers:
- End-to-End Learning: WOMD-LiDAR and WOD-E2E encourage bypassing modular engineering with fully differentiable, sensor-to-prediction systems integrating raw point clouds, images, and maps (Chen et al., 2023, Xu et al., 30 Oct 2025).
- Reasoning and Planning: Integration of language-driven interaction reasoning (e.g., Motion-LLaVA) and traffic rule compliance are now feasible atop multi-modal Q&A datasets (Li et al., 5 Jul 2024).
- Diversity and Rare Events: Curated long-tail benchmarks and risk-filtered scenarios focus attention on the most challenging and relevant failures for autonomous system deployment, supplementing nominal-driving evaluation (Xu et al., 30 Oct 2025, Puphal et al., 30 Jun 2025).
- Multi-task and Cross-domain: Expansion into panoptic, action, and event detection, as well as cross-location domain adaptation studies (ROAD++) are directly supported (Mei et al., 2022, Khan et al., 3 Nov 2024).
WOD and its open extensions provide the computational and statistical foundation for the next generation of interpretable, robust, and principled research in real-world autonomous driving.