Papers
Topics
Authors
Recent
2000 character limit reached

Waymo End-to-End Driving Dataset

Updated 4 December 2025
  • The Waymo End-to-End Driving Dataset is a curated benchmark for vision-based autonomous driving that focuses on rare, safety-critical long-tail scenarios from a multi-million-mile corpus.
  • Each segment features synchronized 360° imagery from eight cameras along with detailed ego state histories and routing commands, facilitating robust multi-modal evaluation.
  • A novel Rater Feedback Score (RFS) metric is introduced to reward expert-preferred, multi-modal trajectory predictions and expose limitations of conventional ADE and FDE metrics.

The Waymo End-to-End Driving Dataset (WOD-E2E) is a curated benchmark targeting vision-based end-to-end (E2E) autonomous driving systems, with particular emphasis on rare, safety-critical long-tail scenarios. Comprising 4,021 segments totaling approximately 12 hours, each selected from a multi-million-mile corpus, WOD-E2E stresses the generalization and robustness capabilities of E2E agents by focusing on events with observed frequencies below 0.03% in daily driving. Each segment includes synchronized 360-degree imagery from eight cameras, high-level routing commands, and detailed ego state histories. A novel open-loop evaluation metric—Rater Feedback Score (RFS)—is introduced to more effectively assess multi-modal future trajectory prediction, leveraging expert rater preferences for validation.

1. Dataset Composition and Scenario Categorization

WOD-E2E consists of 4,021 driving segments, each of 20 s duration, amounting to approximately 12 hours of challenging driving data. Segments are stratified into 11 “long-tail clusters,” which encapsulate infrequent but operationally critical scenarios:

Cluster Example Subtypes
Construction Road closures, surfacing, flaggers
Intersection Unprotected turns, traffic violations
Pedestrians Occlusions, erratic movement
Cyclists Loss of control, group interactions
Multi-Lane Maneuvers Freeway merges, overtakes
Single-Lane Maneuvers Narrow-road overtakes, open-doors
Cut-ins Aggressive adjacent/oncoming lane entries
Foreign Object Debris Animals, furniture, hazardous roads
Special Vehicles Emergency pull-over/blocking
Spotlight MLLM-driven mining for novel objects
Others Unclassified critical events

The train/val/test splits are as follows:

  • Training: 2,037 segments (full 5 s future trajectories visible)
  • Validation: 479 segments (full 5 s futures + rater labels)
  • Test (held out): 1,505 segments (12 s input, 8 s future hidden, rater labels withheld for challenge) (Xu et al., 30 Oct 2025).

2. Sensor Modalities and Data Structure

WOD-E2E provides synchronized multi-modal data for each segment:

  • 360° Camera Array: Eight JPEG cameras (front, front-left, front-right, side-left, side-right, rear, rear-left, rear-right) with a frame rate of 10 Hz. Each image is stored at manufacturer resolution (~1920×1280), typically down-sampled to 768×768 for modeling. Horizontal field-of-view per camera spans 70°–90°, covering the complete 360° perimeter.
  • Calibration: Intrinsics and extrinsics are supplied for accurate 3D projection.
  • Ego State History: 4 s of past trajectory at 4 Hz (16 points of x, y in vehicle-centered coordinates), with aligned velocity and acceleration profiles.
  • Routing Commands: Enumerated as {GO_STRAIGHT, GO_LEFT, GO_RIGHT}, derived from future log headings.
  • Data Organization: Segments are provided in a standardized directory layout, supporting efficient loading per modality. Train/val inlude ground-truth futures; validation includes rater preference labels; test omits future and rater labels (Xu et al., 30 Oct 2025).

3. Rater Feedback Score (RFS) Metric

The RFS metric supersedes conventional average (ADE) and final displacement error (FDE) metrics, specifically optimizing for the multi-modal and preference-weighted nature of rare driving events.

Formalism

Let i{1,2,3}i \in \{1,2,3\} index rater-tagged trajectories, each assigned a score si[0,10]s_i \in [0,10]. At horizons t{3,5}t \in \{3, 5\} seconds:

  • Δlati(t)\Delta_{\text{lat}i}(t), Δlngi(t)\Delta_{\text{lng}i}(t): model-prediction lateral/longitudinal errors to rater ii's trajectory.
  • τlat(t)\tau_{\text{lat}}(t), τlng(t)\tau_{\text{lng}}(t): trust-region thresholds scaled by initial speed vv.

Per-trajectory score:

scorei(t)=si0.1max(max{Δlngi(t)τlng(t),Δlati(t)τlat(t)}1,0)\text{score}_i(t) = s_i \cdot 0.1^{\max\Bigl(\max\{\tfrac{\Delta_{\text{lng}i}(t)}{\tau_{\text{lng}}(t)},\, \tfrac{\Delta_{\text{lat}i}(t)}{\tau_{\text{lat}}(t)}\} -1,\,0\Bigr)}

Aggregate best rater score:

Score(t)=maxi=1..3scorei(t)\text{Score}(t) = \max_{i=1..3} \text{score}_i(t)

Overall RFS (floored at 4 for large errors):

RFS=max(4,  Score(3)+Score(5)2)\displaystyle \text{RFS} = \max\Bigl(4,\;\Bigl\lfloor\frac{\text{Score}(3)+\text{Score}(5)}{2}\Bigr\rfloor\Bigr)

Trust-region thresholds scale by vv as:

scale(v)={0.5,v<1.4 0.5+0.5v1.4111.4,1.4v<11 1,v11\text{scale}(v) = \begin{cases} 0.5, & v < 1.4 \ 0.5 + 0.5\frac{v-1.4}{11-1.4}, & 1.4 \leq v < 11 \ 1, & v \geq 11 \end{cases}

Base thresholds:

  • At t=3t=3s: (τˉlat=1.0m,τˉlng=4.0m)(\bar{\tau}_{\text{lat}}=1.0 \text{m},\, \bar{\tau}_{\text{lng}}=4.0 \text{m})
  • At t=5t=5s: (τˉlat=1.8m,τˉlng=7.2m)(\bar{\tau}_{\text{lat}}=1.8 \text{m},\, \bar{\tau}_{\text{lng}}=7.2 \text{m}).

The metric rewards agreement with optimal human-annotated plans and penalizes deviations outside the trust region, reflecting practical safety and legal requirements (Xu et al., 30 Oct 2025).

4. Annotation Protocol and Quality Control

Critical moments are selected by expert labelers identifying the earliest significant event. At each key frame, up to 64 candidate 5 s trajectories are auto-generated using a motion-forecasting model (e.g., Wayformer) and bucketed by lateral actions, velocity, and braking. Labelers select three representative trajectories: optimal (Rank 1), plausible alternative (Rank 2), and sub-optimal (Rank 3).

Each trajectory receives a score on five dimensions: Safety, Legality, Reaction Time, Braking Necessity, and Efficiency; major infractions incur a –2 penalty, minor –1, with Rank 1 trajectories guaranteed at least 6 points. Quality is controlled via double-labeling (10% of segments; Cohen’s κ > 0.75) and rationales for each decision to maintain consistency and reproducibility. Initial filtering reduces the mined long-tail set to 0.03% from 0.1% of total miles (Xu et al., 30 Oct 2025).

5. Baseline Model Performance and Evaluation Insights

The dataset supports rigorous comparison of E2E architectures. Baselines include:

Model RFS ADE Parameters
Swin-Trajectory 7.543 2.814 36M
DiffusionLTF 7.717 2.977 60M
UniPlan 7.779 2.986 60M
Gemini1 Nano 7.528 3.018 --
AutoVLA 7.556 2.958 --
HMVLM 7.736 3.071 --
Poutine 7.986 2.741 --

Key observations include weak correlation between ADE and RFS—low displacement error does not guarantee human-preferred behavior in complex scenarios. Reinforcement learning (RL) targeting RFS leads to superior performance over ADE-optimized RL, with fine-tuning on WOD-E2E yielding moderate RFS gains (≈+0.08), and fusion of multi-camera inputs and test-time sampling further improving results. Qualitative exemplars demonstrate that trust-region adherence yields perfect RFS, while major errors invoke the minimum score penalty (Xu et al., 30 Oct 2025).

6. Access Patterns, Best Practices, and Future Research Directions

Data and code are accessible at https://waymo.com/open/data/e2e, with Python/TensorFlow API examples provided. For leaderboard participation, predicted 5 s trajectories are uploaded in prescribed JSON format.

Recommended practices include:

  • Pre-training on nominal datasets, then fine-tuning on WOD-E2E to maximize exposure to long-tail distributions.
  • Spatial fusion of all eight camera modalities for occlusion-aware context.
  • Multi-sample test-time approaches to capture prediction multi-modality.
  • Aligning RL reward with RFS rather than ADE for safety-critical outcomes.

Potential extensions involve integration with simulators (CARLA, NavSim), addition of map-based and LiDAR channels, leveraging large vision-LLMs for object-centric reasoning, conducting worst-case safety analyses, and curriculum learning for incremental exposure to rarer events (Xu et al., 30 Oct 2025). A plausible implication is that multi-modal evaluation frameworks and rater-driven reference standards, as established in WOD-E2E, will catalyze new directions in reliable, generalizable E2E agent design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Waymo End-to-End Driving Dataset.