GenAD: Generalized Predictive Model for Autonomous Driving (2403.09630v2)

Published 14 Mar 2024 in cs.CV

Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.

PDF Abstract

Exploring the Frontier of Large-Scale Video Prediction for Autonomous Driving with OpenDV-2K and GenAD

Introduction to OpenDV-2K and GenAD

In the field of autonomous driving, the capability to predict future driving scenarios is paramount for the development of intelligent systems that can make informed decisions. Recognizing this, the paper titled "Generalized Predictive Model for Autonomous Driving" introduces OpenDV-2K, an extensive multimodal driving video dataset, and GenAD, a generative model designed for autonomous driving applications. OpenDV-2K stands out as the largest dataset to date in this domain, featuring over 2000 hours of driving videos that encapsulate an unprecedented level of diversity including different geographic locations, weather conditions, and traffic dynamics. Building on this rich dataset, GenAD leverages the power of latent diffusion models to predict future driving scenes with remarkable accuracy and generalizability.

Data Diversity and Collection

OpenDV-2K curates a wide array of driving videos from YouTube and licensed datasets, aiming to capture the global spectrum of driving conditions. This collection process ensures a high level of diversity in terms of geographic distribution, traffic scenarios, weather conditions, and more. The dataset is meticulously cleaned and annotated with descriptive texts to enrich the data quality, making it highly conducive for model training and testing. Through a comprehensive analysis, OpenDV-2K demonstrates a broad coverage of driving scenarios, making it an ideal benchmark for advancing autonomous driving research.

GenAD: Structure and Learning Process

GenAD introduces a novel temporal generative model that operates in two stages: image domain transfer and video prediction pre-training. Initially, it fine-tunes an image diffusion model on driving images from OpenDV-2K to grasp the domain-specific visual details. Subsequently, the model incorporates temporal reasoning blocks into the learning process, allowing it to capture the dynamic nature of driving scenes efficiently. This innovative approach facilitates strong generalization across diverse driving datasets and enables zero-shot domain transfer capabilities.

Practical Implications and Theoretical Contributions

The GenAD model's exceptional performance on video prediction has profound implications for autonomous driving. Besides surpassing contemporary models in zero-shot generalization tasks, it showcases versatility by adapting to action-conditioned prediction models and motion planning tasks. This adaptability indicates the model's potential to serve as a foundational component in developing more advanced autonomous driving systems. Theoretically, GenAD contributes to the understanding of how predictive modeling of dynamic and complex real-world scenarios can be optimized through large-scale, diverse datasets and specially tailored generative models.

Future Directions and Conclusion

Looking forward, the GenAD framework opens up avenues for further refining the model's efficiency and deployment capabilities. Moreover, the OpenDV-2K dataset presents opportunities for extensive research outside video prediction, such as perception tasks and policy learning in autonomous driving. This paper not only sets a new benchmark in autonomous driving research with OpenDV-2K and GenAD but also paves the way for future explorations into scalable predictive modeling and its practical applications in the field.