Waymo Open Motion Dataset Overview

Updated 27 October 2025

WOMD is a large-scale, high-resolution dataset featuring multimodal sensor data, detailed annotations, and diverse urban scenarios.
It supports key tasks such as perception, motion forecasting, and semantic segmentation with rigorous benchmarks and evaluation metrics.
Extensions like WOMD-LiDAR and WOMD-Reasoning further enhance its scope, enabling advanced research in detection, tracking, and interaction reasoning.

The Waymo Open Motion Dataset (WOMD) is a large-scale, high-resolution collection of multimodal sensor data, annotations, and derived benchmarks for advancing motion forecasting, detection, and scene understanding in autonomous driving research. Designed to address the limitations of previous datasets in diversity, scale, and annotation quality, WOMD provides meticulously annotated data for developing and evaluating core perception and planning algorithms, with a focus on the challenges posed by interactive and dynamic traffic environments.

1. Dataset Composition, Structure, and Sensor Suite

WOMD originally consists of over 100,000 curated scenes, with each scene sampled as a 20-second segment at 10 Hz, yielding 200 frames per scene. Scenes were recorded across six U.S. cities, capturing approximately 570 hours and spanning 1,750 km of urban and suburban roadways (Ettinger et al., 2021). The dataset comprises high-quality, time-synchronized, and spatially calibrated data from five LiDAR sensors and five high-resolution pinhole cameras mounted on autonomous vehicles. LiDAR data are released as range images, including both returns per pulse and rich per-pixel attributes (range, intensity, elongation, vehicle pose), while camera images are downsampled JPEGs with rolling shutter corrections (Sun et al., 2019).

The scale and diversity of WOMD are significant: the dataset achieves a 15× higher diversity index over other LiDAR+camera datasets according to a geographic coverage metric (with 76 km² diluted by ego-poses at a 150 m radius) (Sun et al., 2019). Scenes were systematically mined to include rare and challenging interaction scenarios—such as merges, unprotected turns, and pedestrian crossings—using semantic predicates executed within a relational data mining pipeline (Ettinger et al., 2021).

The dataset structure supports both standard splits (focused on “nontrivial” single-agent motion) and specialized splits emphasizing interactive agent pairs or group scenarios. Extended datasets (e.g., WOMD-LiDAR (Chen et al., 2023)) augment WOMD with raw, calibrated LiDAR point clouds for each scene, effectively increasing the data scale by approximately 100× compared to the initial Waymo Open Dataset (WOD).

2. Annotation Protocols, Label Types, and Quality Assurance

Annotations in WOMD are generated through a multi-phase, high-accuracy, offboard auto-labeling system. For motion forecasting, agent states—including 3D bounding box (center, heading, length, width, height), velocity vector, and validity flag—are produced for each agent at each time step (Ettinger et al., 2021). These are generated by (1) 3D object proposal via LiDAR detection, (2) multi-agent tracking across frames, and (3) track-centric refinement using full temporal context.

For perception tasks, LiDAR points are annotated with 3D boxes (center coordinates, dimensions, and heading, i.e., 7-DOF), each track identified by globally consistent IDs across frames (Sun et al., 2019). For images, axis-aligned 2D bounding boxes are annotated (center pixel coordinates, horizontal length, vertical width, i.e., 4-DOF), with consistent temporal identity. All annotations undergo extensive multi-stage manual and automated verification using industrial-strength labeling tools.

Emerging derivatives, such as ROAD–Waymo, further enrich the dataset with spatiotemporal annotations for agents, action classes, and locations (totaling 12.4M labels across 198,000 frames and 54,000 agent tubes), leveraging a combination of manual labeling and automated SAT-solver–based constraint enforcement with more than 250 semantic rules to assure logical consistency (Khan et al., 3 Nov 2024). WOMD-Reasoning expands annotation into the language domain, adding 3 million Q&A language pairs describing scene elements, interactions, and intentions through an automated rule-based and chain-of-thought prompting pipeline (Li et al., 5 Jul 2024).

3. Supported Tasks, Baselines, and Evaluation Metrics

WOMD provides exhaustive benchmarks for key autonomous driving tasks.

Perception: 2D/3D object detection and multi-object tracking are primary benchmarks. For 3D detection, the baseline is a PointPillars implementation (voxelizing the point cloud, then processing via a CNN-based region proposal network); for tracking, Hungarian matching on “1 – IoU” is employed with Kalman filter state propagation (tracking 10 state variables: position, dimensions, heading, velocity) (Sun et al., 2019). 2D tracking baselines include adaptations of Faster R-CNN-based Tracktor algorithms.
Motion forecasting: Both single-agent and joint/marginal multi-agent motion prediction are supported. Baselines include constant-velocity predictors and deep learning models, e.g., LSTM encoders (using agent and map features) (Ettinger et al., 2021). State-of-the-art solutions—such as DenseTNT (anchor-free, dense goal probability estimation), transformer-based MTR-A (motion query pairs for intention localization and iterative refinement), and MotionPerceiver (recursive latent state self-attention for occupancy forecasting)—anchor benchmarking and provide open-source reference implementations (Gu et al., 2021, Shi et al., 2022, Ferenczi et al., 2023).
Semantic Segmentation: Multi-modal methods such as vFusedSeg3D combine DLA34- and Point Transformer V3–based backbones with geometric and semantic feature fusion modules, achieving state-of-the-art mIoU (72.46%) on validation (Amjad et al., 9 Aug 2024).

Key evaluation metrics include:

Detection: Average Precision (AP), Average Precision with Heading (APH), multi-object tracking measures (MOTA, MOTP).
Forecasting: Minimum Average Displacement Error (minADE), Minimum Final Displacement Error (minFDE), Overlap Rate (OR), Miss Rate (MR), Mean Average Precision (mAP), Soft Intersection-over-Union (Soft IoU).
Language Integration: Task-specific improvements in MR₆ and minFDE₆ when incorporating language cues, e.g.,

$MR_6 (\text{without language}) = 11.44\% \rightarrow MR_6 (\text{with language}) = 10.28\%$

$minFDE_6 (\text{without language}) = 1.16 \rightarrow minFDE_6 (\text{with language}) = 1.08$

(Li et al., 5 Jul 2024).

4. Extensions, Derived Datasets, and Data Quality Improvements

WOMD has become the nucleus for a family of extended datasets and quality improvements:

WOMD-LiDAR: Integrates well-synchronized, high-quality LiDAR point clouds into WOMD, applying delta encoding to compress over 20TB of raw LiDAR into manageable storage (∼2.3TB). LiDAR embeddings extracted via pre-trained models like SWFormer yield ~2% mAP improvement for motion forecasting (Chen et al., 2023).
WOMD-Reasoning: Adds 3 million language Q&A exemplars enabling interaction reasoning, intent description, and rule-based interaction annotation (Li et al., 5 Jul 2024).
ROAD–Waymo: Enhances video frames and bounding box tracks with agent, action, and location multi-labels, subject to automated SAT-solver conflict checks against 251 domain-relevant requirements, ensuring that logical impossibilities (e.g., traffic light “red” and “green” simultaneous) do not appear (Khan et al., 3 Nov 2024).
Interaction Dataset: Extracts >37,000 AV–traffic light and >44,000 AV–stop sign interaction trajectories using rule-based segment identification and wavelet-based denoising (db6 DWT), with anomaly rates in acceleration and jerk profiles reduced to near-zero (Li et al., 21 Jan 2025).
Traffic Signal Correction: A fully automated procedure for traffic signal state imputation/rectification uses vehicle kinematics (distance to stop line, acceleration/velocity) and map-derived intersection geometry, reducing signal state uncertainty (71.7% imputation rate) and the red-light violation rate from 15.7% to 2.9% (Yan et al., 8 Jun 2025).
Risk Filtering and Retrieval: Probabilistic risk models identify high-value/complex “first-order” and “second-order” driving situations to support more robust AV evaluation (Puphal et al., 30 Jun 2025), while open-vocabulary retrieval frameworks (WayMoCo) enable targeted mining of rare VRU scenarios via language-aligned multimodal embeddings (Englmeier et al., 1 Aug 2025).

5. Generalization, Domain Shift, and Limitations

WOMD explicitly addresses generalization and domain gap analysis. Benchmarks highlight that models trained in one geography demonstrate significant performance drops—an example being an 8-point APH decline for vehicles when switching from San Francisco to the suburban Phoenix + Mountain View domain (Sun et al., 2019). Such findings motivate continued research on domain adaptation, transfer learning, and robust generalization across varied urban contexts.

However, external validation raises concerns about dataset limitations:

Behavioral envelope analyses using an independent, naturalistic Level 4 AV dataset from Phoenix, Arizona (PHX), reveal that WOMD underrepresents short headways and abrupt decelerations in intersection discharges, car-following, and lane-changing events. Dynamic Time Warping (DTW) distances and Kolmogorov–Smirnov statistics consistently place PHX behaviors outside WOMD’s coverage, suggesting that models trained solely on WOMD may systematically underestimate behavioral variability, risk, and complexity (Zhang et al., 3 Sep 2025).
WOMD’s segmentation into discrete 20-second clips and the absence of detailed error quantification or disclosure of perception/post-processing algorithms may obscure rare, high-risk, or discontinuous behaviors important for modeling real-world risk.
Studies advise cautious use for behavioral modeling, particularly for safety-critical policy learning, and recommend external validation against broader naturalistic data and application of error correction methods (e.g., SIMEX) (Zhang et al., 3 Sep 2025).

6. Access, Community Benchmarks, and Ongoing Impact

WOMD and its extensions are generally openly available: data, code, documentation, and benchmarks are provided at [www.waymo.com/open] and various affiliated GitHub repositories (Sun et al., 2019, Chen et al., 2023, Khan et al., 3 Nov 2024, Yan et al., 8 Jun 2025, Puphal et al., 30 Jun 2025). Leaderboards and standardized splits facilitate global comparison.

Community benchmarks—such as the Waymo Open Dataset Challenges—have spurred advancements in perception and prediction (e.g., top-performing models including DenseTNT, PV-RCNN, VFusedSeg3D, MotionPerceiver) (Gu et al., 2021, Shi et al., 2020, Amjad et al., 9 Aug 2024, Ferenczi et al., 2023). Open-sourcing of derived risk metrics, improved signal data, language annotations, and retrieval indices further enable experimentation, reproducibility, and cross-comparison (Li et al., 5 Jul 2024, Yan et al., 8 Jun 2025, Puphal et al., 30 Jun 2025, Englmeier et al., 1 Aug 2025).

7. Research Significance and Future Directions

WOMD has reshaped the landscape of autonomous driving benchmarks by (1) combining scale, diversity, and high-fidelity multi-modal sensors; (2) adopting rigorous, reproducible, and ever-expanding annotation and evaluation procedures; and (3) facilitating the paper of detection, tracking, forecasting, interaction reasoning, and language grounding in traffic scenes.

Future directions include extending end-to-end models to exploit raw sensor streams (bypassing intermediate abstractions), integrating reasoning and language for explainable AVs, supporting open-set perception with unsupervised or self-supervised labels, and systematically benchmarking robustness under domain shift and behavioral validation with richly naturalistic data (Najibi et al., 2022, Chen et al., 2023, Li et al., 5 Jul 2024, Zhang et al., 3 Sep 2025).

The dataset not only supports technical progress in perception and prediction but also fosters the development, evaluation, and critical appraisal of safe, reliable, and generalizable autonomous driving systems in complex, real-world environments.