Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving (2311.16038v1)

Published 27 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
  2. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  3. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  4. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
  5. Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. In ICCV, pages 9387–9398, 2023.
  6. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In CVPR, pages 4193–4202, 2020.
  7. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In CVPR, pages 12547–12556, 2021.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  9. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023.
  10. Generative adversarial nets. NeurIPS, 27, 2014.
  11. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022.
  12. World models. arXiv preprint arXiv:1803.10122, 2018.
  13. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  14. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021a.
  15. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023a.
  16. Safe local motion planning with self-supervised freespace forecasting. In CVPR, 2021b.
  17. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
  18. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023b.
  19. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  20. Selfocc: Self-supervised vision-based 3d occupancy prediction. arXiv preprint arXiv:2311.12754, 2023a.
  21. Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, pages 9223–9232, 2023b.
  22. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. arXiv preprint arXiv:2303.05760, 2023c.
  23. Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving. IEEE transactions on neural networks and learning systems, 2023d.
  24. Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
  25. Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
  26. Differentiable raycasting for self-supervised occupancy forecasting. In ECCV, 2022.
  27. Point cloud forecasting as a proxy for 4d occupancy forecasting. In CVPR, pages 1116–1124, 2023.
  28. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  29. Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, pages 3351–3359, 2020.
  30. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
  31. Drivingdiffusion: Layout-guided multi-view driving scene video generation with latent diffusion model. arXiv preprint arXiv:2310.07771, 2023.
  32. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022b.
  33. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022c.
  34. Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
  35. Amvnet: Assertion-based multi-view fusion network for lidar semantic segmentation. arXiv preprint arXiv:2012.04934, 2020.
  36. Multimodal motion prediction with stacked transformers. In CVPR, 2021.
  37. Vectormapnet: End-to-end vectorized hd map learning. arXiv preprint arXiv:2206.08920, 2022.
  38. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  39. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  40. Self-supervised point cloud prediction using 3d spatio-temporal convolutional networks. In CoRL, pages 1444–1454. PMLR, 2022.
  41. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
  42. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
  43. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736, 2006.
  44. Categorical depth distribution network for monocular 3d object detection. In CVPR, 2021.
  45. Plant: Explainable planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022.
  46. Lmscnet: Lightweight multiscale 3d semantic completion. In 2020 International Conference on 3D Vision (3DV), pages 111–119, 2020.
  47. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  48. Very deep convolutional networks for large-scale image recognition. arXiv, abs/1409.1556, 2014.
  49. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  50. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  51. Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, pages 685–702, 2020.
  52. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
  53. Scene as occupancy. In ICCV, pages 8406–8415, 2023.
  54. Safetynet: Safe planning for real-world self-driving vehicles using machine-learned policies. In 2022 International Conference on Robotics and Automation (ICRA), pages 897–904, 2022.
  55. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023a.
  56. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023b.
  57. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, pages 21729–21740, 2023.
  58. Inverting the pose forecasting pipeline with spf2: Sequential pointcloud forecasting for sequential pose forecasting. In Conference on robot learning, pages 11–20, 2021.
  59. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In AAAI, pages 3101–3109, 2021.
  60. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
  61. Lidarmultinet: Towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385, 2022.
  62. Drinet++: Efficient voxel-as-point point cloud segmentation. arXiv preprint arXiv: 2111.08318, 2021.
  63. Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving. arXiv preprint arXiv:2308.01006, 2023.
  64. End-to-end interpretable neural motion planner. In CVPR, 2019.
  65. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022a.
  66. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022b.
  67. Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization. In IROS, pages 1450–1457, 2021.
  68. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, pages 9939–9948, 2021.
  69. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenzhao Zheng (64 papers)
  2. Weiliang Chen (14 papers)
  3. Yuanhui Huang (14 papers)
  4. Borui Zhang (15 papers)
  5. Yueqi Duan (47 papers)
  6. Jiwen Lu (192 papers)
Citations (45)

Summary

  • The paper presents OccWorld, which leverages 3D occupancy grids to model dynamic driving scenes with enhanced detail over traditional methods.
  • It uses a VQVAE-based scene tokenizer and a generative transformer to forecast future occupancy and vehicle trajectories, achieving competitive IoU and L2 metrics.
  • This integrated approach minimizes reliance on extensive annotations, paving the way for scalable, self-supervised systems in autonomous driving.

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

The paper "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving" by Wenzhao Zheng et al. presents a novel approach to modeling 3D environments for autonomous driving using occupancy grids. This paper introduces a framework called OccWorld, which leverages 3D occupancy space to predict both the future states of dynamic environments and the trajectory of autonomous vehicles, thus offering a comprehensive model of scene evolution.

Core Contributions

The authors identify several challenges in current autonomous driving systems that rely heavily on the prediction of bounding box movements and semantic maps. They propose 3D occupancy grids as a superior alternative, citing their expressiveness, efficiency, and versatility. Specifically, they note that 3D occupancy can capture finer details of the environment, is easier to acquire from sparse LiDAR points, and adapts to both vision and LiDAR inputs.

A key aspect of OccWorld is its use of a 3D occupancy scene tokenizer which leverages a vector-quantized variational autoencoder (VQVAE). This method produces discrete scene tokens, enabling efficient and compact representation which is suitable for a generative transformer-based spatial-temporal modeling approach. The use of a generative pre-training transformer architecture allows the OccWorld framework to forecast future occupancy and ego trajectory, handling both the dynamic and static elements of autonomous driving environments.

Numerical Results

The evaluation of OccWorld is demonstrated on the nuScenes benchmark, showcasing its ability to model the evolution of driving scenes without requiring instance and map supervision. The results present an average Intersection over Union (IoU) of 26.63% for a 3-second future prediction given a 2-second history. This performance metric indicates a level of proficiency in anticipating future driving scenarios by understanding spatial dynamics and temporal sequences. The authors also highlight an L2 error of 1.16 in planning trajectories, demonstrating competitive performance in motion planning tasks.

Implications and Future Directions

OccWorld proposes a paradigm shift from traditional sequential processing in autonomous driving (i.e., perception, prediction, and planning) to an integrated model leveraging occupancy data. The paper underlines the potential of employing generative models and self-supervised learning techniques for efficient forecasting of 3D occupancy without extensive labeled data requirements. Furthermore, the successful implementation of OccWorld could substantially reduce the complexity and computational demands associated with high-definition map annotations and 3D bounding box tracking.

The current limitation, as identified by the authors, is the inability to forecast appearances of new agents (vehicles) that are absent from past inputs. Future work could explore hybrid models incorporating additional sensor modalities or improved latent space representations to mitigate such limitations. The potential applicability of OccWorld extends beyond autonomous driving, suggesting its utility in enhanced robotic navigation and scene understanding applications, thereby inspiring further advancements in 3D scene modeling and prediction.

In conclusion, OccWorld presents a promising advancement in autonomous driving by utilizing a comprehensive 3D occupancy world model, thereby carving the path towards more interpretable and self-supervised systems. The framework's results indicate it effectively understands the evolution of driving environments, establishing a foundation for more robust and scalable applications in autonomous vehicle technology.

Github Logo Streamline Icon: https://streamlinehq.com