OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving (2311.16038v1)
Abstract: Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
- Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. In ICCV, pages 9387–9398, 2023.
- 3d sketch-aware semantic scene completion via semi-supervised structure prior. In CVPR, pages 4193–4202, 2020.
- 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In CVPR, pages 12547–12556, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023.
- Generative adversarial nets. NeurIPS, 27, 2014.
- Vip3d: End-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021a.
- Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023a.
- Safe local motion planning with self-supervised freespace forecasting. In CVPR, 2021b.
- St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
- Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023b.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Selfocc: Self-supervised vision-based 3d occupancy prediction. arXiv preprint arXiv:2311.12754, 2023a.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, pages 9223–9232, 2023b.
- Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. arXiv preprint arXiv:2303.05760, 2023c.
- Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving. IEEE transactions on neural networks and learning systems, 2023d.
- Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
- Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
- Differentiable raycasting for self-supervised occupancy forecasting. In ECCV, 2022.
- Point cloud forecasting as a proxy for 4d occupancy forecasting. In CVPR, pages 1116–1124, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, pages 3351–3359, 2020.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
- Drivingdiffusion: Layout-guided multi-view driving scene video generation with latent diffusion model. arXiv preprint arXiv:2310.07771, 2023.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022b.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022c.
- Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
- Amvnet: Assertion-based multi-view fusion network for lidar semantic segmentation. arXiv preprint arXiv:2012.04934, 2020.
- Multimodal motion prediction with stacked transformers. In CVPR, 2021.
- Vectormapnet: End-to-end vectorized hd map learning. arXiv preprint arXiv:2206.08920, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Self-supervised point cloud prediction using 3d spatio-temporal convolutional networks. In CoRL, pages 1444–1454. PMLR, 2022.
- Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
- Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736, 2006.
- Categorical depth distribution network for monocular 3d object detection. In CVPR, 2021.
- Plant: Explainable planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022.
- Lmscnet: Lightweight multiscale 3d semantic completion. In 2020 International Conference on 3D Vision (3DV), pages 111–119, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Very deep convolutional networks for large-scale image recognition. arXiv, abs/1409.1556, 2014.
- Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Going deeper with convolutions. In CVPR, pages 1–9, 2015.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, pages 685–702, 2020.
- Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
- Scene as occupancy. In ICCV, pages 8406–8415, 2023.
- Safetynet: Safe planning for real-world self-driving vehicles using machine-learned policies. In 2022 International Conference on Robotics and Automation (ICRA), pages 897–904, 2022.
- Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023a.
- Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023b.
- Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, pages 21729–21740, 2023.
- Inverting the pose forecasting pipeline with spf2: Sequential pointcloud forecasting for sequential pose forecasting. In Conference on robot learning, pages 11–20, 2021.
- Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In AAAI, pages 3101–3109, 2021.
- Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
- Lidarmultinet: Towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385, 2022.
- Drinet++: Efficient voxel-as-point point cloud segmentation. arXiv preprint arXiv: 2111.08318, 2021.
- Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving. arXiv preprint arXiv:2308.01006, 2023.
- End-to-end interpretable neural motion planner. In CVPR, 2019.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022a.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022b.
- Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization. In IROS, pages 1450–1457, 2021.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, pages 9939–9948, 2021.
- Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023.
- Wenzhao Zheng (64 papers)
- Weiliang Chen (14 papers)
- Yuanhui Huang (14 papers)
- Borui Zhang (15 papers)
- Yueqi Duan (47 papers)
- Jiwen Lu (192 papers)