Papers
Topics
Authors
Recent
2000 character limit reached

Waymo End-to-End Driving Dataset (WOD-E2E)

Updated 10 December 2025
  • The Waymo End-to-End Driving Dataset (WOD-E2E) is a large-scale, open benchmark focused on vision-based end-to-end driving in rare, safety-critical scenarios.
  • It provides complete 360° visual context, navigation intents, and detailed vehicle states to capture complex long-tail driving events.
  • A novel Rater Feedback Score (RFS) is introduced to align model performance with human safety and preference criteria, advancing robust autonomous driving research.

The Waymo End-to-End Driving Dataset (WOD-E2E) is a large-scale, open benchmark specifically designed for the evaluation and development of vision-based end-to-end (E2E) driving systems in challenging, safety-critical, and rare “long-tail” road scenarios. Unlike prior benchmarks dominated by routine driving, WOD-E2E explicitly curates data from less than 0.03% of real driving events, capturing the complexity and multi-modality inherent in dangerous or uncommon on-road situations. Each segment in WOD-E2E provides full 360° visual context, navigation intent, detailed vehicle state, and, for validation, explicit human rating of plausible multimodal trajectories. The dataset also introduces a novel evaluation paradigm—the Rater Feedback Score (RFS)—grounded in human judgment and amenable to multimodal assessment, thereby aligning model evaluation more closely with real-world safety and preference criteria (Xu et al., 30 Oct 2025).

1. Motivation and Distinction from Existing Benchmarks

Vision-based E2E autonomous driving has traditionally leveraged datasets such as nuScenes, NAVSIM, WOMD, and CoVLA, which are primarily populated with everyday, nominal driving events. However, in such corpora, safety-critical “long-tail” scenarios (e.g., occluded pedestrian emergences, debris encounters, atypical maneuvers) are exceedingly rare, leading to models insufficiently exposed to or evaluated under true high-risk situations. WOD-E2E was constructed to remedy this deficit by mining 6.4 million miles of Waymo’s real driving logs for challenging segments that satisfy both a manually defined taxonomy of eleven long-tail event types and the constraint of corpus frequency below 0.03%. This ensures coverage of events that pose outsized risk but are nearly invisible in generic data collections (Xu et al., 30 Oct 2025).

2. Dataset Composition and Scenario Taxonomy

Each WOD-E2E segment is of fixed 20 s duration and drawn from one of eleven pre-defined scenario clusters: Construction zones, Intersections, occluded or erratic Pedestrians, Cyclists in distress, Multi-lane maneuvers, Single-lane maneuvers, aggressive Cut-ins, Foreign Object Debris, Special Vehicles (e.g., emergency, oversized), manually spotlighted “Other” events, and a residual “Spotlight” bin. Post-filtering, the dataset consists of 4,021 segments (∼ twelve hours of continuous driving). The split comprises 2,037 training, 479 validation, and 1,505 held-out test segments. Over half of the data is concentrated in Intersections, Foreign Object Debris, and Pedestrian categories. Each segment contains:

  • 8-camera, 360° visual coverage at 10 Hz (JPEG, with calibrated intrinsics/extrinsics, right-handed vehicle and sensor frames).
  • High-level routing commands as enums {GO_STRAIGHT, GO_LEFT, GO_RIGHT}, derived from route-change computation.
  • Ego-vehicle states: 4 s past trajectory (4 Hz), velocity/acceleration, and, for non-test splits, ground-truth 5 s future trajectory.
  • Scenario cluster label and, for validation, human preference scores on candidate trajectories.
Segment Duration Cameras Scenario Clusters Split: Train/Val/Test Total Segments
20 s 8 11 2037/479/1505 4021

3. Annotation Protocol and Human Preference Judgments

WOD-E2E introduces a multi-stage annotation protocol:

  1. Critical-Moment Selection: Expert raters review the entire 20 s segment, marking the earliest frame where a planning-critical decision becomes visually apparent (with rationale documentation to anchor consistency).
  2. Trajectory Sampling: Automated forecasting generates up to 64 diverse 5 s forward trajectories. These are bucketed by coarse maneuver (left, straight, right) and reduced (≤12 prototypes) by behavioral diversity.
  3. Trajectory Scoring: In a custom visualization environment, raters select three diverse candidate trajectories. One trajectory is required to demonstrate safety and legality (score ≥ 6), while the other two illuminate plausible but sub-optimal choices. Scoring is initialized at 10, with deductions for major (safety, legality, reaction time: −2 each) and minor (braking, efficiency: −1 each) infractions, as well as discretionary penalties.

These scores, tabulated in [0,10], serve as the target labels for human-aligned model evaluation.

4. The Rater Feedback Score (RFS) Metric

Classical open-loop metrics, such as Average Displacement Error (ADE, L₂ distance), fail to capture the multimodal and preference-driven aspects of decision-making in hazardous or ambiguous contexts. WOD-E2E proposes RFS, a metric based on model alignment with rater-annotated reference trajectories {ri(t)}\{ r_i(t) \} and associated scores si[0,10]s_i \in [0, 10] at t{3,5}t \in \{3,5\} s into the candidate trajectory. A velocity-scaled rectangular trust region is constructed:

  • Lateral/longitudinal base thresholds: τˉlat(3)=1.0\bar{\tau}_{lat}(3) = 1.0, τˉlng(3)=4.0\bar{\tau}_{lng}(3) = 4.0; τˉlat(5)=1.8\bar{\tau}_{lat}(5) = 1.8, τˉlng(5)=7.2\bar{\tau}_{lng}(5) = 7.2
  • Scaled by

scale(v)={0.5,v<1.4 0.5+0.5v1.4111.4,1.4v<11 1.0,v11\text{scale}(v) = \begin{cases} 0.5, & v < 1.4 \ 0.5+0.5\frac{v-1.4}{11-1.4}, & 1.4 \leq v < 11 \ 1.0, & v \geq 11 \end{cases}

so that τlat/lng=scale(v)τˉlat/lng\tau_{lat/lng} = \text{scale}(v) \cdot \bar{\tau}_{lat/lng}.

A predicted trajectory p(t)p(t) is scored against each rir_i as:

scorei(t)=si×0.1max(max(Δlng/τlng,  Δlat/τlat)1,0)\mathrm{score}_i(t) = s_i \times 0.1^{\max(\max(\Delta_{\text{lng}}/\tau_{\text{lng}},\; \Delta_{\text{lat}}/\tau_{\text{lat}})-1,\, 0)}

where Δlat\Delta_{\text{lat}} and Δlng\Delta_{\text{lng}} are lateral and longitudinal errors. The final RFS is the highest such value across the three reference trajectories, averaged over t=3,5t=3,5s and floored at 4:

RFS=maxi12t{3,5}scorei(t),4\mathrm{RFS} = \max_i \left\lfloor \frac{1}{2} \sum_{t \in \{3,5\}} \mathrm{score}_i(t),\, 4 \right\rfloor

This metric directly encodes human-prioritized factors (safety, legality, plausibility), supports multi-modal judgment, and imposes meaningful penalties for deviation from accepted reference behaviors.

5. Baselines, Benchmark Results, and Metric Insights

Baseline experiments establish NaiveEMMA—a minimal EMMA variant fine-tuned from Gemini Flash—as the reference (RFS = 7.528, ADE = 3.018 m). Ablations reveal RFS improvement with admission of (a) full training data, (b) all 8 cameras, (c) multi-sample scaling (RFS increasing from 7.14 to 7.39). Over 30 external models populate the leaderboard, with both MLP and diffusion-based architectures:

Model Type RFS ADE (m)
NaiveEMMA (baseline) Transformer 7.528 3.018
Swin-Trajectory MLP 7.543 2.814
DiffusionLTF DiffusionDrive 7.717 2.977
UniPlan DiffusionDrive 7.779 2.986
AutoVLA (CoT+ADE-RL) MLLM 7.556 2.958
HMVLM (CoT only) MLLM 7.736 3.071
Poutine (CoT+RFS-RL) MLLM 7.986 2.741

Critically, RFS and ADE are only weakly correlated, indicating that geometric proximity does not guarantee human-preferred or safe behavior, especially in rare and ambiguous scenarios. Reinforcement learning using RFS as the direct reward (as in Poutine) leads to further metric improvements, while pure ADE optimization or reliance on nominal datasets results in underperformance on genuine long-tail events.

6. Data Access, Licensing, and the WOD-E2E Challenge

All training and validation data—camera images, ego-vehicle state, navigation intent, scenario labels, and for validation, rater preference scores—are available under the Waymo Open Dataset license at https://waymo.com/open. Test set labels remain sequestered for the biennial WOD-E2E Challenge. The 2025 challenge iteration evaluates models on previously unseen long-tail segments, awarding prizes based on RFS.

7. Research Impact and Prospects

By focusing on the <0.03% of rare, high-consequence events and introducing a human-aligned multi-modal evaluation protocol, WOD-E2E sets a new standard for open-loop E2E driving benchmarks. Its combination of complete 360° perception, precise routing and kinematic context, and preference-based reference scoring is positioned to facilitate the development of robust, generalizable, and safety-compliant autonomous driving systems, particularly those synergistic with MLLMs and other data-driven policy frameworks (Xu et al., 30 Oct 2025). A plausible implication is that future E2E driving research adopting RFS-aligned optimization is more likely to yield agents capable of robust generalization to the most hazardous on-road contingencies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Waymo End-to-End Driving Dataset (WOD-E2E).