DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT (2412.19505v2)

Published 27 Dec 2024 in cs.CV

Abstract: Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at https://github.com/YvanYin/DrivingWorld.

Summary

The paper presents a novel GPT-style framework that generates long-duration video sequences for autonomous driving using temporally aware tokenization.
It employs a hybrid token prediction strategy with masking and reweighting to mitigate drift and preserve coherence over sequences exceeding 40 seconds.
The model demonstrates improved predictive accuracy and video quality, doubling sequence lengths and enabling more robust autonomous driving decisions.

DrivingWorld: A Video Generation Model for Autonomous Driving

The paper "DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT" introduces a novel framework for generating high-fidelity, long-duration video sequences pertinent to autonomous driving. Drawing inspiration from autoregressive models like GPT, well-established in NLP, the authors aim to bridge a substantial gap in applying similar methodologies to video-based tasks, specifically targeting the domain of autonomous driving.

The autonomous driving domain faces a critical challenge in predicting future states based on past observations. While various approaches have been applied, these often result in unsatisfactory outputs due to their inability to capture the intricate spatial and temporal dynamics of video data. Prior efforts such as GAIA-1 implemented autoregressive models from text to video but fell short in qualitative aspects like maintaining fidelity and coherence. DrivingWorld addresses these limitations with its bespoke spatial-temporal fusion mechanisms.

Contributions and Methodology

The core of DrivingWorld lies in its adoption of a GPT-style framework, enhanced to handle the complexities of video data which inherently feature spatial and temporal dynamics beyond the capabilities of classic 1D contexts, like text. Several innovative strategies form the bedrock of this model:

Temporal-Aware Tokenization: The model encodes both spatial and temporal data as temporally coherent tokens. This facilitates accurate next-frame prediction by focusing on temporally linked features across frames.
Hybrid Token Prediction Strategy: DrivingWorld integrates a next-state prediction coupled with next-token prediction. By first establishing temporal coherence and then exploring spatial details, it models transitions between states more effectively.
Enhanced Control through Masking and Reweighting: To address long-term drift—a common issue where video fidelity deteriorates over extended sequences—DrivingWorld implements a masking and reweighting strategy. This mitigates the model's drift during long-term predictions, ensuring consistency in frame outputs over sequences extending beyond 40 seconds.

Results and Implications

Empirical evaluations underscore the strength of DrivingWorld, showcasing its ability to generate sequences over twice as long as those of the current state-of-the-art, coupled with superior visual quality and predictive accuracy. This is particularly noteworthy in applications requiring nuanced prediction and decision-making, such as driving scenarios with varying environmental contexts and challenges.

The implications of this research extend beyond mere video generation, presenting practical avenues for improved planning and decision-making in autonomous systems. By leveraging a model trained extensively on a hybrid dataset amalgamating real-world driving data, DrivingWorld champions robustness against a spectrum of scenarios, including out-of-distribution scenarios unforeseeable through conventional training datasets.

Future Directions

While DrivingWorld marks significant progress, future research could delve into multi-modal integration—synthesizing data from different sensor types (e.g., Lidar, radar) with video data—enhancing comprehensive scene understanding. Additionally, integrating more advanced prediction frameworks could enhance model robustness against highly dynamic and rare scenarios, further positioning DrivingWorld as a foundation for next-generation autonomous driving solutions.

In summary, the DrivingWorld model represents a significant advancement in video-based model prediction for autonomous driving, adeptly pushing the envelope toward longer, more coherent video sequence generation. Through its novel spatial-temporal architecture, it opens promising avenues for richer predictive models critical for autonomous system advancements.

PDF Markdown

Related Papers

GitHub

GitHub - YvanYin/DrivingWorld: Code for "DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT" (51 stars)

Tweets

https://twitter.com/zhenjun_zhao/status/1877264596051836996

https://twitter.com/YvanYin0808/status/1878365222823395523

https://twitter.com/YvanYin0808/status/1878365659471782341

https://twitter.com/huxiaotaostasy/status/1880375263814644037

YouTube

Show All Videos