OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving (2311.16038v1)

Published 27 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.

References (69)

Authors (6)

Wenzhao Zheng (64 papers)
Weiliang Chen (14 papers)
Yuanhui Huang (14 papers)
Borui Zhang (15 papers)
Yueqi Duan (47 papers)
Jiwen Lu (192 papers)

Citations (45)

View on Semantic Scholar

Summary

The paper presents OccWorld, which leverages 3D occupancy grids to model dynamic driving scenes with enhanced detail over traditional methods.
It uses a VQVAE-based scene tokenizer and a generative transformer to forecast future occupancy and vehicle trajectories, achieving competitive IoU and L2 metrics.
This integrated approach minimizes reliance on extensive annotations, paving the way for scalable, self-supervised systems in autonomous driving.

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

The paper "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving" by Wenzhao Zheng et al. presents a novel approach to modeling 3D environments for autonomous driving using occupancy grids. This paper introduces a framework called OccWorld, which leverages 3D occupancy space to predict both the future states of dynamic environments and the trajectory of autonomous vehicles, thus offering a comprehensive model of scene evolution.

Core Contributions

The authors identify several challenges in current autonomous driving systems that rely heavily on the prediction of bounding box movements and semantic maps. They propose 3D occupancy grids as a superior alternative, citing their expressiveness, efficiency, and versatility. Specifically, they note that 3D occupancy can capture finer details of the environment, is easier to acquire from sparse LiDAR points, and adapts to both vision and LiDAR inputs.

A key aspect of OccWorld is its use of a 3D occupancy scene tokenizer which leverages a vector-quantized variational autoencoder (VQVAE). This method produces discrete scene tokens, enabling efficient and compact representation which is suitable for a generative transformer-based spatial-temporal modeling approach. The use of a generative pre-training transformer architecture allows the OccWorld framework to forecast future occupancy and ego trajectory, handling both the dynamic and static elements of autonomous driving environments.

Numerical Results

The evaluation of OccWorld is demonstrated on the nuScenes benchmark, showcasing its ability to model the evolution of driving scenes without requiring instance and map supervision. The results present an average Intersection over Union (IoU) of 26.63% for a 3-second future prediction given a 2-second history. This performance metric indicates a level of proficiency in anticipating future driving scenarios by understanding spatial dynamics and temporal sequences. The authors also highlight an L2 error of 1.16 in planning trajectories, demonstrating competitive performance in motion planning tasks.

Implications and Future Directions

OccWorld proposes a paradigm shift from traditional sequential processing in autonomous driving (i.e., perception, prediction, and planning) to an integrated model leveraging occupancy data. The paper underlines the potential of employing generative models and self-supervised learning techniques for efficient forecasting of 3D occupancy without extensive labeled data requirements. Furthermore, the successful implementation of OccWorld could substantially reduce the complexity and computational demands associated with high-definition map annotations and 3D bounding box tracking.

The current limitation, as identified by the authors, is the inability to forecast appearances of new agents (vehicles) that are absent from past inputs. Future work could explore hybrid models incorporating additional sensor modalities or improved latent space representations to mitigate such limitations. The potential applicability of OccWorld extends beyond autonomous driving, suggesting its utility in enhanced robotic navigation and scene understanding applications, thereby inspiring further advancements in 3D scene modeling and prediction.

In conclusion, OccWorld presents a promising advancement in autonomous driving by utilizing a comprehensive 3D occupancy world model, thereby carving the path towards more interpretable and self-supervised systems. The framework's results indicate it effectively understands the evolution of driving environments, establishing a foundation for more robust and scalable applications in autonomous vehicle technology.

PDF Markdown

GitHub

GitHub - wzzheng/OccWorld: 3D World Model for Autonomous Driving (302 stars)