Doe-1: Closed-Loop Autonomous Driving with Large World Model (2412.09627v1)

Published 12 Dec 2024 in cs.CV, cs.AI, and cs.LG

Abstract: End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.

Authors (6)

Wenzhao Zheng (64 papers)
Zetian Xia (1 paper)
Yuanhui Huang (14 papers)
Sicheng Zuo (7 papers)
Jie Zhou (687 papers)
Jiwen Lu (192 papers)

Summary

Overview of "Doe-1: Closed-Loop Autonomous Driving with Large World Model"

The paper presents "Doe-1," a sophisticated closed-loop autonomous driving framework that stands out for unifying perception, prediction, and planning using a large Driving wOrld modEl. The central novelty of Doe-1 lies in its autoregressive generative approach, which manages these tasks in an integrated manner, casting autonomous driving as a next-token generation challenge. This framework positions Doe-1 as a pioneering entity achieving closed-loop end-to-end autonomous driving.

Key Advances and Methodology

The authors introduce a generative autoregressive world model that formulates the evolution of driving scenes through a sequence of transitions: observation→description, description→action, and action→observation. Utilizing multi-modal tokens, the model processes these tasks in an intertwined manner, which, according to the authors, overcomes limitations seen in traditional open-loop frameworks, such as weak scalability and a lack of high-order interactions. The multimodal framework inherently supports large-scale modeling by aligning inputs from vision-centric data (i.e., RGB images) with textual descriptions and tokenized actions.

The model employs a position-aware tokenizer to convert action space into discrete tokens, while tokenizing images and scene descriptions through vector-quantized variational autoencoders. Doe-1 incorporates self-attention mechanisms to seamlessly manage these transitions, effectively modeling the driving environment without intermediate scene representations.

Experimental Results

Doe-1’s capabilities were evaluated on the well-recognized nuScenes dataset, demonstrating robust performance across various driving tasks. It excels particularly in visual question-answering and action-conditioned video generation, showcasing its ability to infer context and generate future observations based on conditioned actions. These results collectively validate the efficacy of using a next-token prediction formulation in autonomous driving.

Quantitatively, Doe-1 has shown competitive results. For instance, in visual question-answering tasks measured against benchmarks such as OmniDrive, Doe-1 yielded promising scores in metrics like METEOR and CIDEr, signifying its proficient scene understanding and response accuracy. It also delivered satisfactory performance in end-to-end motion planning, with collision rates and L2 errors competitive against existing models, despite relying solely on front-view input to simulate driving scenarios.

Practical and Theoretical Implications

The transition towards closed-loop modeling as advocated by Doe-1 marks a significant shift from the conventional method of decomposing driving tasks into separate modules. The unified approach not only predicts trajectory but also anticipates future scene interactions conditioned by potential actions. This has a profound impact on practical scalability and robustness of autonomous systems, enabling them to adapt to the dynamic nature of real-world driving.

Theoretically, Doe-1 underlines a compelling paradigm where the boundary between perception, decision-making, and action is blurred, suggesting a new direction in autonomous driving AI research. It reinforces the potential of generative models to handle multifaceted tasks by sharing learned representations across modalities.

Future Directions

Future advancements may focus on augmenting Doe-1’s capabilities with richer sensor inputs like LiDAR or surround-view camera feeds, thus bolstering situational awareness. Another avenue lies in further enhancing the interpretability of generative models, which remains a critical challenge, particularly concerning safety in autonomous driving contexts. The integration of more diverse datasets can also be emphasized to ensure generalizability and adaptability across varied driving conditions.

In summary, the paper makes a compelling case for a paradigm shift in autonomous driving research, advocating for an interconnected, closed-loop system architecture exemplified by Doe-1. It embarks on unlocking new capabilities for autonomous systems by streamlining perception with decision-making and action execution within an autoregressive framework.

PDF Markdown

Related Papers

GitHub

GitHub - wzzheng/Doe: Doe-1: Closed-Loop Autonomous Driving with Large World Model (9 stars)

Tweets

https://twitter.com/jbohnslav/status/1869070001166315683

https://twitter.com/ArxivToday/status/1867614429077975223