Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Drive via Asymmetric Self-Play (2409.18218v1)

Published 26 Sep 2024 in cs.RO, cs.CV, and cs.LG

Abstract: Large-scale data is crucial for learning realistic and capable driving policies. However, it can be impractical to rely on scaling datasets with real data alone. The majority of driving data is uninteresting, and deliberately collecting new long-tail scenarios is expensive and unsafe. We propose asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios. Our approach pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them. When applied to traffic simulation, we learn realistic policies with significantly fewer collisions in both nominal and long-tail scenarios. Our policies further zero-shot transfer to generate training data for end-to-end autonomy, significantly outperforming state-of-the-art adversarial approaches, or using real data alone. For more information, visit https://waabi.ai/selfplay .

Summary

  • The paper introduces a novel teacher–student framework that uses asymmetric self-play to generate safe yet challenging driving scenarios.
  • It employs a transformer-based architecture and extensive simulations, demonstrating significant collision reduction compared to baselines.
  • The method scales data generation for autonomous driving while enhancing policy generalization to various safety-critical scenarios.

An Analytical Overview of "Learning to Drive via Asymmetric Self-Play"

Introduction

The paper "Learning to Drive via Asymmetric Self-Play" by Zhang et al. addresses the critical issue of acquiring vast, diverse driving data to train autonomous driving policies effectively. Collecting real-world driving data is both expensive and dangerous, especially when it involves edge cases. This paper introduces a novel solution through asymmetric self-play, pairing a teacher policy (which generates scenarios solvable by itself but not the student) with a student policy that learns to navigate these challenging scenarios. This approach is proposed to scale training beyond the limitations of real-world data, resulting in more robust and realistic driving policies.

Problem Formulation and Approach

Traffic Modeling

The authors formulate traffic modeling as a multiagent problem over a series of time steps. Each traffic scenario consists of a high-definition (HD) map and the states and actions of multiple actors. The policy guiding these actors is modeled using the kinematic bicycle model, with a focus on interactions amongst the actors and their environment.

Asymmetric Self-Play Learning

In this framework, a teacher policy (πT\pi_T) generates challenging scenarios that are realistically solvable, aiming to expose weaker aspects of the student. The student policy (πS\pi_S) aims to solve these scenarios. The training alternates control between the teacher and student in a manner ensuring fairness and consistency, refining both policies over numerous iterations.

The primary objective for the teacher policy includes terms for generating solvable (low collision) and realistic (close to data distribution) scenarios. The student policy, meanwhile, emphasizes avoiding collisions and maintaining realism during interactions. Theoretical guarantees are provided to show that student policies achieve α\alpha-β\beta-optimality, ensuring they can solve reasonably realistic scenarios.

Implementation and Experiments

Neural Architecture

The authors implement these policies with a transformer-based architecture. The model encodes map features, actor states, and interactions between actors using attention mechanisms. The policy can then deterministically predict actions for the actors based on these encoded features.

Traffic Simulation Experiments

The experiments span three datasets: Argoverse2 Motion, Highway, and a specially curated Safety dataset. Various baselines, including Closed-loop Imitation Learning (IL), TrafficSim, multiagent reinforcement learning (MARL), and adversarial methods like KING, are used for comparison. The proposed asymmetric self-play method consistently outperformed these baselines, particularly in reducing collision rates while maintaining other realism metrics such as final displacement error (FDE) and similarities in driving behavior distributions.

Zero-shot Scenario Generation for Autonomy Systems

Further, the teacher policy's ability to generalize and generate useful training scenarios for unseen end-to-end autonomy systems was tested. The trained student policies were shown to improve the performance of both object-based and object-free autonomy systems on safety-critical scenarios. Traditional supervision and adversarial methods were found less effective in comparison.

Analysis and Ablation Studies

Extensive ablations confirmed the necessity of incorporating both solvability and realism objectives within the teacher policy for optimal student performance. The paper also demonstrated that design choices such as the three-player game formulation and action replay during training substantially benefited the policy's performance.

Implications and Future Directions

This work presents several significant theoretical and practical implications:

  • Scalability: The approach circumvents the practical barriers of real-world data collection, providing a scalable method for generating diverse and challenging driving scenarios.
  • Robustness: Training policies in adversarial settings using self-play enhances their robustness to rare and challenging scenarios, which is crucial for real-world deployment of autonomous driving systems.
  • Generalization: The teacher's generalization capacity to create effective training data for different autonomy systems underscores the method's broad applicability.

Future research directions could explore increasing the diversity of scenarios beyond collisions, addressing other failure modalities such as perception errors and off-road events. Additionally, incorporating controllable scenario generation could allow for more targeted policy training.

Conclusion

The paper "Learning to Drive via Asymmetric Self-Play" introduces a robust framework for training driving policies via generated synthetic scenarios, addressing the challenges of real-data dependency. This work constitutes an essential step towards more scalable and comprehensive solutions for autonomous driving, yielding improvements in both practical deployment and theoretical understanding of self-play mechanisms in multiagent systems. The presented experiments and theoretical guarantees offer a solid foundation for further innovations and applications in the autonomous driving domain.