GenAD: Generative End-to-End Autonomous Driving (2402.11502v3)
Abstract: Directly producing planning results from raw sensors has been a long-desired solution for autonomous driving and has attracted increasing attention recently. Most existing end-to-end autonomous driving methods factorize this problem into perception, motion prediction, and planning. However, we argue that the conventional progressive pipeline still cannot comprehensively model the entire traffic evolution process, e.g., the future interaction between the ego car and other traffic participants and the structural trajectory prior. In this paper, we explore a new paradigm for end-to-end autonomous driving, where the key is to predict how the ego car and the surroundings evolve given past scenes. We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling. We further adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. GenAD finally simultaneously performs motion prediction and planning by sampling distributions in the learned structural latent space conditioned on the instance tokens and using the learned temporal model to generate futures. Extensive experiments on the widely used nuScenes benchmark show that the proposed GenAD achieves state-of-the-art performance on vision-centric end-to-end autonomous driving with high efficiency. Code: https://github.com/wzzheng/GenAD.
- A rule-based behaviour planner for autonomous driving. In IJCRR, pages 263–279, 2022.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
- Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019.
- Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. arXiv preprint arXiv:2303.11301, 2023.
- Mpnp: Multi-policy neural planner for urban driving. In IROS, pages 10549–10554, 2022.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, page 1724, 2014.
- Parting with misconceptions about learning-based vehicle motion planning. In Conference on Robot Learning (CoRL), 2023.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pages 9224–9232, 2018.
- Vip3d: End-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016a.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016b.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021a.
- Safe local motion planning with self-supervised freespace forecasting. In CVPR, 2021b.
- St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
- Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, pages 9223–9232, 2023.
- Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022a.
- Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
- Polarformer: Multi-camera 3d object detection with polar transformers. arXiv preprint arXiv:2206.15398, 2022b.
- Differentiable raycasting for self-supervised occupancy forecasting. In ECCV, 2022.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022b.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022c.
- Learning lane graph representations for motion forecasting. In ECCV, 2020.
- Bevfusion: A simple and robust lidar-camera fusion framework. arXiv preprint arXiv:2205.13790, 2022.
- Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
- Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
- Multimodal motion prediction with stacked transformers. In CVPR, 2021.
- Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022a.
- Vectormapnet: End-to-end vectorized hd map learning. arXiv preprint arXiv:2206.08920, 2022b.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022c.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Voxel transformer for 3d object detection. In ICCV, pages 3164–3173, 2021.
- Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417, 2021.
- Covernet: Multimodal behavior prediction using trajectory sets. In CVPR, 2020.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
- Safe real-world autonomous driving by learning to predict and plan with a mixture of experts. In ICRA, pages 10069–10075, 2023.
- Maximum margin planning. In ICML, pages 729–736, 2006.
- Categorical depth distribution network for monocular 3d object detection. In CVPR, 2021.
- Urban driver: Learning to drive from real-world demonstrations using policy gradients. In Conference on Robot Learning, pages 718–728. PMLR, 2022.
- Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
- Scene as occupancy. In ICCV, pages 8406–8415, 2023.
- Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
- Attention is all you need. NeurIPS, 2017.
- Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
- Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, pages 21729–21740, 2023.
- End-to-end interpretable neural motion planner. In CVPR, 2019.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022a.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022b.
- Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, pages 4490–4499, 2018.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023.