VaViM and VaVAM: Autonomous Driving through Video Generative Modeling (2502.15672v1)

Published 21 Feb 2025 in cs.CV, cs.AI, and cs.RO

Abstract: We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel

Summary

The paper introduces VaViM, an auto-regressive video model, and VaVAM, its companion video-action model, demonstrating the use of video generative pre-training for end-to-end autonomous driving.
VaViM learns spatio-temporal dynamics by modeling tokenized video sequences, while VaVAM integrates an action expert trained via imitation learning on driving datasets to generate trajectories.
Evaluations show VaVAM achieves commendable performance in photorealistic simulations, particularly in safety-critical scenarios, highlighting the potential of generative models despite challenges in trajectory adherence and adaptive decision-making.

An Overview of VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

The integration of generative video models into the field of autonomous driving marks a significant step in the field, as exemplified by the work on VaViM and VaVAM. This research investigates how large-scale video generation and action modeling can be utilized for autonomous driving, presenting a comprehensive pipeline from perception to action. VaViM, an auto-regressive video model, and VaVAM, its companion video-action model, collectively exemplify the potential of video generative pre-training to be leveraged for navigating real-world driving environments.

Model Architecture and Training

VaViM is formulated as an auto-regressive model that predicts video frames through spatio-temoral token sequences, capturing the nuanced semantics and dynamics of driving scenarios. The model utilizes a pre-trained vector quantization-based tokenizer (VQ-VAE) to map high-fidelity video frames into a discrete token space, enabling efficient and scalable video generation learning. A GPT-2-inspired architecture then models these token sequences to capture temporal dependencies.

To extend this framework into actionable outputs for autonomous driving, VaVAM incorporates an action expert module based on flow matching, trained through imitation learning on real-world driving datasets. This module specializes in generating driving trajectories by incrementally refining random noise into a coherent sequence of driving maneuvers using learned video features.

Data and Methodology

Two distinct stages outline the data management for training VaViM and VaVAM. Initially, VaViM is pre-trained using a diverse mix of data sources, predominantly OpenDV, which comprises over 1700 hours of non-annotated driving videos from various regions. Subsequent fine-tuning on datasets like nuPlan and nuScenes allows VaViM to adapt to task-specific nuances, such as lane keeping and obstacle avoidance typically encountered in autonomous driving contexts.

VaVAM, on the other hand, undergoes imitation learning on ground-truth expert trajectories derived from nuPlan and nuScenes, supported by high-level navigational commands. This training paradigm bridges the gap between video understanding and decision-making, respecting the operational safety requirements in autonomous domains.

Evaluation and Results

The results are contextualized through both open- and closed-loop evaluation frameworks. Open-loop evaluations focus primarily on the accuracy of predicted trajectories compared to expert demonstrations, where the model demonstrates improved performance with increased computational resources and more extensive training data. Despite a decrease in trajectory diversity with scaling, the models consistently outperform smaller architectures.

Closed-loop evaluations using NeuroNCAP present a more comprehensive analysis, probing the models' performance in controlled, photorealistic simulations that include dynamic vehicular interactions and adversarial scenarios. VaVAM achieves commendable scores, particularly in frontal collision scenarios, highlighting its efficacy in safety-critical evaluations. However, the challenges in trajectory adherence and safety metrics indicate an ongoing trade-off between fidelity to training data and adaptive decision-making capabilities required for safe navigation.

Implications and Future Directions

This analysis of VaViM and VaVAM underscores the role of generative video models as foundational components that can advance autonomous vehicle systems. From a theoretical standpoint, the models introduce new pathways to understanding spatial-temporal dynamics and interactions within a driving context. Practically, the work informs future strategies in model scaling, dataset curation, and the application of self-supervised learning techniques to extend capabilities beyond current perceivable limits.

The potential future avenues entail integrating comprehensive reward systems and enhancing model architectures to develop robust world models. Such improvements would ensure that video generative models not just anticipate possible future states but also rationalize action under uncertainty, delivering reliable autonomous driving solutions. Additionally, exploring richer input modalities and advanced post-processing techniques could amplify the fidelity and applicability of these models in real-world environments.

Related Papers

Find Related Papers

Tweets

https://twitter.com/valeoai/status/1894007671570726967

https://twitter.com/valeoai/status/1894007661206569064

https://twitter.com/valeoai/status/1932731719914041536

https://twitter.com/semisance/status/1893895986730136001

https://twitter.com/jbohnslav/status/1899449344173977734