Waymo End-to-End Driving Dataset
- The Waymo End-to-End Driving Dataset is a curated benchmark for vision-based autonomous driving that focuses on rare, safety-critical long-tail scenarios from a multi-million-mile corpus.
- Each segment features synchronized 360° imagery from eight cameras along with detailed ego state histories and routing commands, facilitating robust multi-modal evaluation.
- A novel Rater Feedback Score (RFS) metric is introduced to reward expert-preferred, multi-modal trajectory predictions and expose limitations of conventional ADE and FDE metrics.
The Waymo End-to-End Driving Dataset (WOD-E2E) is a curated benchmark targeting vision-based end-to-end (E2E) autonomous driving systems, with particular emphasis on rare, safety-critical long-tail scenarios. Comprising 4,021 segments totaling approximately 12 hours, each selected from a multi-million-mile corpus, WOD-E2E stresses the generalization and robustness capabilities of E2E agents by focusing on events with observed frequencies below 0.03% in daily driving. Each segment includes synchronized 360-degree imagery from eight cameras, high-level routing commands, and detailed ego state histories. A novel open-loop evaluation metric—Rater Feedback Score (RFS)—is introduced to more effectively assess multi-modal future trajectory prediction, leveraging expert rater preferences for validation.
1. Dataset Composition and Scenario Categorization
WOD-E2E consists of 4,021 driving segments, each of 20 s duration, amounting to approximately 12 hours of challenging driving data. Segments are stratified into 11 “long-tail clusters,” which encapsulate infrequent but operationally critical scenarios:
| Cluster | Example Subtypes |
|---|---|
| Construction | Road closures, surfacing, flaggers |
| Intersection | Unprotected turns, traffic violations |
| Pedestrians | Occlusions, erratic movement |
| Cyclists | Loss of control, group interactions |
| Multi-Lane Maneuvers | Freeway merges, overtakes |
| Single-Lane Maneuvers | Narrow-road overtakes, open-doors |
| Cut-ins | Aggressive adjacent/oncoming lane entries |
| Foreign Object Debris | Animals, furniture, hazardous roads |
| Special Vehicles | Emergency pull-over/blocking |
| Spotlight | MLLM-driven mining for novel objects |
| Others | Unclassified critical events |
The train/val/test splits are as follows:
- Training: 2,037 segments (full 5 s future trajectories visible)
- Validation: 479 segments (full 5 s futures + rater labels)
- Test (held out): 1,505 segments (12 s input, 8 s future hidden, rater labels withheld for challenge) (Xu et al., 30 Oct 2025).
2. Sensor Modalities and Data Structure
WOD-E2E provides synchronized multi-modal data for each segment:
- 360° Camera Array: Eight JPEG cameras (front, front-left, front-right, side-left, side-right, rear, rear-left, rear-right) with a frame rate of 10 Hz. Each image is stored at manufacturer resolution (~1920×1280), typically down-sampled to 768×768 for modeling. Horizontal field-of-view per camera spans 70°–90°, covering the complete 360° perimeter.
- Calibration: Intrinsics and extrinsics are supplied for accurate 3D projection.
- Ego State History: 4 s of past trajectory at 4 Hz (16 points of x, y in vehicle-centered coordinates), with aligned velocity and acceleration profiles.
- Routing Commands: Enumerated as {GO_STRAIGHT, GO_LEFT, GO_RIGHT}, derived from future log headings.
- Data Organization: Segments are provided in a standardized directory layout, supporting efficient loading per modality. Train/val inlude ground-truth futures; validation includes rater preference labels; test omits future and rater labels (Xu et al., 30 Oct 2025).
3. Rater Feedback Score (RFS) Metric
The RFS metric supersedes conventional average (ADE) and final displacement error (FDE) metrics, specifically optimizing for the multi-modal and preference-weighted nature of rare driving events.
Formalism
Let index rater-tagged trajectories, each assigned a score . At horizons seconds:
- , : model-prediction lateral/longitudinal errors to rater 's trajectory.
- , : trust-region thresholds scaled by initial speed .
Per-trajectory score:
Aggregate best rater score:
Overall RFS (floored at 4 for large errors):
Trust-region thresholds scale by as:
Base thresholds:
- At s:
- At s: .
The metric rewards agreement with optimal human-annotated plans and penalizes deviations outside the trust region, reflecting practical safety and legal requirements (Xu et al., 30 Oct 2025).
4. Annotation Protocol and Quality Control
Critical moments are selected by expert labelers identifying the earliest significant event. At each key frame, up to 64 candidate 5 s trajectories are auto-generated using a motion-forecasting model (e.g., Wayformer) and bucketed by lateral actions, velocity, and braking. Labelers select three representative trajectories: optimal (Rank 1), plausible alternative (Rank 2), and sub-optimal (Rank 3).
Each trajectory receives a score on five dimensions: Safety, Legality, Reaction Time, Braking Necessity, and Efficiency; major infractions incur a –2 penalty, minor –1, with Rank 1 trajectories guaranteed at least 6 points. Quality is controlled via double-labeling (10% of segments; Cohen’s κ > 0.75) and rationales for each decision to maintain consistency and reproducibility. Initial filtering reduces the mined long-tail set to 0.03% from 0.1% of total miles (Xu et al., 30 Oct 2025).
5. Baseline Model Performance and Evaluation Insights
The dataset supports rigorous comparison of E2E architectures. Baselines include:
| Model | RFS | ADE | Parameters |
|---|---|---|---|
| Swin-Trajectory | 7.543 | 2.814 | 36M |
| DiffusionLTF | 7.717 | 2.977 | 60M |
| UniPlan | 7.779 | 2.986 | 60M |
| Gemini1 Nano | 7.528 | 3.018 | -- |
| AutoVLA | 7.556 | 2.958 | -- |
| HMVLM | 7.736 | 3.071 | -- |
| Poutine | 7.986 | 2.741 | -- |
Key observations include weak correlation between ADE and RFS—low displacement error does not guarantee human-preferred behavior in complex scenarios. Reinforcement learning (RL) targeting RFS leads to superior performance over ADE-optimized RL, with fine-tuning on WOD-E2E yielding moderate RFS gains (≈+0.08), and fusion of multi-camera inputs and test-time sampling further improving results. Qualitative exemplars demonstrate that trust-region adherence yields perfect RFS, while major errors invoke the minimum score penalty (Xu et al., 30 Oct 2025).
6. Access Patterns, Best Practices, and Future Research Directions
Data and code are accessible at https://waymo.com/open/data/e2e, with Python/TensorFlow API examples provided. For leaderboard participation, predicted 5 s trajectories are uploaded in prescribed JSON format.
Recommended practices include:
- Pre-training on nominal datasets, then fine-tuning on WOD-E2E to maximize exposure to long-tail distributions.
- Spatial fusion of all eight camera modalities for occlusion-aware context.
- Multi-sample test-time approaches to capture prediction multi-modality.
- Aligning RL reward with RFS rather than ADE for safety-critical outcomes.
Potential extensions involve integration with simulators (CARLA, NavSim), addition of map-based and LiDAR channels, leveraging large vision-LLMs for object-centric reasoning, conducting worst-case safety analyses, and curriculum learning for incremental exposure to rarer events (Xu et al., 30 Oct 2025). A plausible implication is that multi-modal evaluation frameworks and rater-driven reference standards, as established in WOD-E2E, will catalyze new directions in reliable, generalizable E2E agent design.