Papers
Topics
Authors
Recent
2000 character limit reached

Waymo Open Sim Agents Challenge (WOSAC)

Updated 27 December 2025
  • WOSAC is a benchmark for evaluating multi-agent driving simulators, emphasizing closed-loop realism, safety, and interaction fidelity.
  • It leverages the Waymo Open Motion Dataset with rigorous protocols and standardized metrics to assess agent behavior over an 8-second horizon.
  • Innovative approaches including transformer models, conditional VAEs, and hybrid RL+IL techniques drive improvements in AV simulation robustness and scalability.

The Waymo Open Sim Agents Challenge (WOSAC) is a public, data-driven benchmark that evaluates the fidelity, interactivity, and robustness of multi-agent driving simulators, with an emphasis on applicability to autonomous vehicle (AV) development and research. Drawing from the Waymo Open Motion Dataset (WOMD), WOSAC defines rigorous agent modeling and evaluation protocols, provides a standardized metric suite, and hosts a continuously updated public leaderboard. The challenge has catalyzed rapid progress in closed-loop, interactive traffic simulation methods—spanning imitation learning, reinforcement learning, hybrid model-based/data-driven systems, and specialized post-processing pipelines—serving as both a technical crucible and a unifying testbed for the sim agent community (Montali et al., 2023).

1. Benchmark Definition, Dataset, and Evaluation Protocol

WOSAC operationalizes the sim agent benchmarking problem as closed-loop, multi-agent generative modeling: participants submit world models qworldq^\mathrm{world} that, given map and observation context, must autoregressively "roll out" plausible future behaviors (positions, velocities, headings) for all agents over an 8 s horizon at 10 Hz, starting from 1 s of logged history (Montali et al., 2023). Each scenario typically includes vehicles, pedestrians, and cyclists, spanning urban and highway contexts, intersections, merges, and parking maneuvers (Liang, 20 Dec 2025).

Participants' models are factorized into an ego-policy π\pi and an environment dynamics model qq, supporting direct replacement of the AV "under test" (conditional world modeling). All agent actions must be sampled online at each tick, enforcing closed-loop consistency—off-policy open-loop prediction is explicitly disallowed.

Evaluation aggregates model-generated rollouts across 32 stochastic seeds per scenario. A comprehensive suite of per-agent, per-scenario metrics is computed and compared to logged ground-truth, then summarized into a composite "realism meta-metric" as a weighted average of kinematic, interaction, and map-based components such as:

  • Linear and angular speed, acceleration
  • Collision and near-collision rate
  • Distance to nearest object, time-to-collision
  • Off-road and route departure rate
  • Lane and traffic-signal compliance

Metrics are combined using either negative log-likelihood (NLL) or direct similarity (e.g., Kolmogorov–Smirnov) normalization. Collisions and road departures are typically double-weighted to emphasize safety (Montali et al., 2023, Liang, 20 Dec 2025).

2. Modeling Approaches and Baseline Agents

WOSAC has tracked the evolution from simple rule-based and motion-forecasting models to sophisticated transformer architectures, conditional VAEs, autoregressive token generators, and reinforcement learning fine-tuning.

  • Classical Baselines: Constant velocity, constant-velocity-plus-noise, and "Oracle" log-replay are consistently outperformed by data-driven models on realism meta-metrics (Montali et al., 2023).
  • Motion-Transformer Models: MultiVerse Transformer (MVTA/MVTE), MTR+++, and CAD apply transformer encoders/decoders to agent context and motion history, leveraging GMM heads or anchor-based classification, variable history aggregation, and scheduled top-KK sampling (Wang et al., 2023, Qian et al., 2023, Chiu et al., 2023).
  • Conditional VAEs and Polyline Transformers: TrafficBots V1.5 combines a CVAE per-agent policy with a heterogeneous polyline transformer featuring relative pose encoding and scheduled teacher-forcing, offering a blend of multimodal diversity and structural map/neighbor attention (Zhang et al., 16 Jun 2024).
  • Tokenization and NTP Models: Discrete next-token prediction (NTP) models such as SMART, enhanced with TrajTok tokenization and spatial-aware label smoothing, yield powerful close-loop agents that directly optimize over tokenized trajectory sequences (Zhang et al., 23 Jun 2025).
  • Reinforcement Learning Agents: PPO-based self-play has demonstrated scalable, reliability-focused sim agents with near-perfect goal achievement and low infraction rates. Post-hoc RL fine-tuning, such as SMART-R1's R1-style alternating SFT-RFT-SFT pipeline, achieves state-of-the-art realism and safety (Cornelisse et al., 20 Feb 2025, Pei et al., 28 Sep 2025, Peng et al., 26 Sep 2024).
  • Hybrid RL+IL Anchoring: SPACeR uses a KL-regularized RL framework anchored to a centralized reference policy (e.g., SMART), providing a bridge between human-likeness, multi-agent interaction, and efficient self-play simulation (Chang et al., 20 Oct 2025).
  • Model-Based Comparison: Model-based simulators like SUMO, ported to WOMD via Waymo2SUMO, deliver strong long-horizon (60 s) stability and interpretable control but slightly lower short-term fidelity versus learned agents (Liang, 20 Dec 2025).

3. Key Evaluation Metrics and Leaderboard Comparison

The realism meta-metric (or M\mathcal{M}) aggregates dozens of per-agent, per-scenario metrics into group-wise (kinematic, interactive, map-based) and overall scenario-level scores:

  • Realism Meta-Metric: M=∑jwj(1N∑imi,j)\mathcal{M} = \sum_j w_j (\frac{1}{N}\sum_i m_{i,j}), with metric jj and scenario ii (Montali et al., 2023).
  • Component Examples: Speed likelihood, collision indicator, time-to-collision, distance to road edge, offroad likelihood.
  • Submetric Scores: Each group is interpreted in [0,1]\left[0,1\right], with higher being better for realism, and lower for minADE/infraction rates.

Recent top leaderboard scores (2025) include: | Method | Realism Meta | Kinematic | Interactive | Map | minADE | |-------------|-------------|-----------|-------------|-----|--------| | SMART-R1 | 0.7858 | 0.4944 | 0.8110 |0.9201|1.2885 | | TrajTok | 0.7852 | 0.4887 | 0.8116 |0.9207|1.3179 | | TrafficBots | 0.6988 | 0.4304 | 0.7114 |0.8360|1.8825 | | SUMO | 0.653 | 0.3294 | 0.7153 |0.7585| — |

In all cases, higher composite and group scores—and lower minADE/collision/offroad—signal stronger closed-loop behavioral realism and safety (Pei et al., 28 Sep 2025, Zhang et al., 23 Jun 2025, Zhang et al., 16 Jun 2024, Liang, 20 Dec 2025).

4. Algorithmic Innovations and Training Strategies

Multiple lines of technical advancement have been observed:

  • Closed-Loop Autoregression: All top methods implement strict autoregressive sampling at each tick, with the agent's own rollouts dictating subsequent context, addressing covariate shift and compounding error (Montali et al., 2023, Wang et al., 2023).
  • Receding Horizon and Hybrid Sampling: MVTA, CAD, and MTR+++ employ predict-then-execute-one-step strategies, variable history aggregation, and hybrid deterministic/stochastic sampling for stability and diversity (Wang et al., 2023, Qian et al., 2023, Chiu et al., 2023).
  • Spatial-Aware and Rule-Based Tokenization: TrajTok's hybrid rule-plus-data grid-based tokenization, noise filtering, and spatial label smoothing directly improve coverage and robustness in NTP architectures (Zhang et al., 23 Jun 2025).
  • Teacher-Forcing and Collision Filtering: TrafficBots V1.5 introduces scheduled teacher-forcing and inference-time scenario selection (collision-biased filtering) (Zhang et al., 16 Jun 2024).
  • Reinforcement Fine-Tuning: SMART-R1's metric-oriented policy optimization (MPO) leverages the true WOSAC meta-metric as a reward within an iterative SFT–RFT–SFT pipeline, surpassing prior PPO/DPO/GRPO approaches (Pei et al., 28 Sep 2025). RL fine-tuning methods consistently yield nontrivial gains on critical submetrics such as collision avoidance and offroad adherence (Peng et al., 26 Sep 2024).
  • KL-Anchored RL: SPACeR's central anchored RL framework tightly aligns decentralized self-play policies with strong IL reference distributions, improving human-likeness and generalization (Chang et al., 20 Oct 2025).

5. Robustness, Generalization, and Policy Evaluation

Generalization beyond train/test splits and robustness to intervention/replay policies have emerged as focal points:

  • OOD Stress Testing: Self-play agents have demonstrated rapid adaptation to rare layouts (e.g., OOD U-turns and reverse driving) through targeted fine-tuning with only a handful of new scenarios (Cornelisse et al., 20 Feb 2025).
  • Delta/Confusion Metrics: To assess the sensitivity of world models to partial control (e.g., ego replay), delta meta-metrics and confusion rates (simulation and policy) have been introduced. These highlight that standard realism scores alone are insufficient predictors of robustness in closed-loop policy training environments (Schofield et al., 3 Aug 2025).
  • Causality Annotation: Evaluation domains have been extended to include agents causal to the ego, avoiding metric overfitting on non-causal actors and exposing critical interaction failures (Schofield et al., 3 Aug 2025).
  • Planner Policy Evaluation: Recent work benchmarks the capacity of sim agents to reliably evaluate AV planner quality, using rank-correlation and mean evaluation error metrics relative to ground-truth log replay (Peng et al., 26 Sep 2024, Chang et al., 20 Oct 2025).

6. Practical Scaling: Simulators and Infrastructure

The challenge requires high-throughput, parallelizable simulation for both training and leaderboard evaluation:

  • GPUDrive and Waymax: Purpose-built JAX/GPU sims (GPUDRIVE, Waymax) batch hundreds of scenarios in parallel, exposing both RL/evaluation APIs and efficient functional MDP state updates. Waymax integrates seamlessly with learned and IDM agents, supporting joint training and evaluation at ≥1000 Hz sim-throughput (Gulino et al., 2023, Cornelisse et al., 20 Feb 2025).
  • SUMO Integration: Automated Waymo2SUMO pipelines translate WOMD scenarios for direct benchmarking of classical microscopic simulation. SUMO offers near-unbounded rollout stability (>60s), strong long-horizon safety, and minimal parameterization (<100 tunable parameters) (Liang, 20 Dec 2025).
  • Compute Footprint: State-of-the-art RL/NTP entries typically require order-1 day of A100-scale compute for training (≈2 B steps), with top models converging on parameter counts from 7 M (SMART-R1) to 65 K (SPACeR) (Pei et al., 28 Sep 2025, Chang et al., 20 Oct 2025).

7. Open Problems and Future Directions

WOSAC and its recent literature highlight several challenging research axes:

  • Modeling rare events: improved upsampling of collisions, occlusions, object insertion/deletion, and time-varying object geometry (Montali et al., 2023).
  • Stable, scalable VRU (pedestrian, cyclist) simulation: current entries often fix VRUs to log-replay, limiting interaction realism for non-vehicle agents (Chang et al., 20 Oct 2025).
  • Closed-loop scene-centric models: while agent-centric factorization is prevalent, joint, scene-wise diffusion and interaction-aware training are open for further progress (Montali et al., 2023, Wang et al., 2023).
  • Hybrid Model-Based/Data-Driven Systems: unifying the long-horizon stability of SUMO with short-horizon behavioral richness of deep networks is an emerging pathway (Liang, 20 Dec 2025).
  • Metric Alignment and Causal Correctness: refining metrics, evaluation domains, and regularizers to better align with deploy-time safety, controllability, and planner evaluation fidelity (Schofield et al., 3 Aug 2025, Jang et al., 24 Oct 2024).

WOSAC remains a central benchmark, continuously stimulating innovation in the simulation agent ecosystem for autonomous driving research.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Waymo Open Sim Agents Challenge (WOSAC).