Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability (2405.17398v5)

Published 27 May 2024 in cs.CV and cs.AI

Abstract: World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.

Overview of Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

The paper presents Vista, an advanced driving world model developed to address specific limitations in existing models related to generalization to unseen environments, prediction fidelity of critical details, and action controllability. The model is designed to foresee the outcomes of different actions for autonomous driving, which is critical for ensuring safety and efficiency in real-world driving scenarios.

Key Contributions

  1. Enhanced Generalization Capability:
    • Vista leverages a large corpus of worldwide driving videos to improve its generalization capability. Through systematic inclusion of dynamic priors (position, velocity, and acceleration), the model is able to maintain coherent long-horizon rollouts, effectively predicting real-world dynamics in varying scenarios.
  2. High-Fidelity Prediction:
    • Two novel loss functions are introduced: the dynamics enhancement loss and the structure preservation loss. The former prioritizes dynamic regions in the video, such as moving vehicles and sidewalks, while the latter maintains structural details by focusing on high-frequency components in the prediction. These additions significantly enhance the visual accuracy and realism of future predictions at high resolutions (576Ă—1024 pixels).
  3. Versatile Action Controllability:
    • Vista supports a diverse set of action formats, including high-level intentions (commands, goal points) and low-level maneuvers (trajectory, angle, and speed), through a unified conditioning interface and an efficient training strategy. This versatility extends the model's applicability to various autonomous driving tasks, from evaluating high-level policies to executing precise maneuvers.
  4. Evaluation of Real-World Actions:
    • Utilizing its own capabilities, Vista is implemented as a generalizable reward function to evaluate real-world driving actions without requiring ground truth actions. This approach leverages the prediction uncertainty to assess action reliability, enhancing the model's utility in real-world applications where ground truth data is often unavailable.

Experimental Validation

A comprehensive set of experiments demonstrates Vista's superiority over existing driving world models. Key results include:

  • Quantitative Performance:
    • On the nuScenes validation set, Vista outperforms state-of-the-art models with a 55% improvement in FID and a 27% improvement in FVD.
  • Generalization Across Datasets:
    • Vista's predictions were consistently preferred by human evaluators over those from state-of-the-art video generation models across multiple diverse datasets such as OpenDV-YouTube-val, nuScenes, Waymo, and CODA.
  • Long-Horizon Prediction:
    • Unlike previous models, Vista is capable of realistic long-horizon prediction, maintaining high fidelity over 15-second rollouts, a feature critical for long-term planning in autonomous driving.
  • Effective Action Control:
    • Evaluations revealed that applying action controls via high-level intentions or low-level maneuvers resulted in predictions closely mirroring true driving behaviors, evidenced by significant reductions in FVD scores.

Implications and Future Directions

The implications of this research are multifaceted. Practically, Vista's ability to generalize and predict driving dynamics with high fidelity makes it a valuable tool for developing and testing autonomous driving systems. The versatility in action control also means it can be integrated into various stages of autonomous driving pipelines, from high-level planning to low-level motion control.

Theoretically, the paper introduces novel techniques that can be leveraged beyond autonomous driving. The dynamics enhancement and structure preservation loss functions can be adopted in other domains requiring high-fidelity video generation with complex dynamics.

Future research could explore scaling Vista to even larger datasets and integrating it with scalable architectures to further enhance computation efficiency. Additionally, extending Vista's framework to other domains, such as robotics and simulation environments, could prove beneficial.

Conclusion

Vista represents a significant step forward in the development of generalizable driving world models. Its enhanced fidelity, versatile controllability, and robust evaluation mechanism highlight its potential in pushing the boundaries of autonomous driving technologies. Future advancements based on this work could open new avenues for the broader application of AI-driven world models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (142)
  1. Compositional Foundation Models for Hierarchical Planning. In NeurIPS, 2023.
  2. Video Pretraining (VPT): Learning to Act by Watching Unlabeled Online Videos. In NeurIPS, 2022.
  3. Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv preprint arXiv:2401.12945, 2024.
  4. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. In ICLR, 2024.
  5. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv preprint arXiv:2311.15127, 2023.
  6. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In CVPR, 2023.
  7. MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations. arXiv preprint arXiv:2311.11762, 2023.
  8. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR, 2023.
  9. Genie: Generative Interactive Environments. arXiv preprint arXiv:2402.15391, 2024.
  10. nuScenes: A Multimodal Dataset for Autonomous Driving. In CVPR, 2020.
  11. nuPlan: A Closed-Loop ML-based Planning Benchmark for Autonomous Vehicles. In CVPR Workshops, 2021.
  12. MP3: A Unified Model to Map, Perceive, Predict and Plan. In CVPR, 2021.
  13. Using Left and Right Brains Together: Towards Vision and Language Planning. arXiv preprint arXiv:2402.10534, 2024.
  14. Learning by Cheating. In CoRL, 2019.
  15. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. arXiv preprint arXiv:2310.19512, 2023.
  16. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. In CVPR, 2024.
  17. End-to-End Autonomous Driving: Challenges and Frontiers. arXiv preprint arXiv:2306.16927, 2023.
  18. GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation. In CVPR, 2024.
  19. TransFuser: Imitation with Transformer-based Sensor Fusion for Autonomous Driving. IEEE TPAMI, 2023.
  20. Emu: Enhancing Image Generation Models using Photogenic Needles in a Haystack. arXiv preprint arXiv:2309.15807, 2023.
  21. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021.
  22. Learning Universal Policies via Text-Guided Video Generation. In NeurIPS, 2023.
  23. Video Language Planning. In ICLR, 2024.
  24. Visual Foresight: Model-based Deep Reinforcement Learning for Vision-based Robotic Control. arXiv preprint arXiv:1812.00568, 2018.
  25. Video Prediction Models as Rewards for Reinforcement Learning. In NeurIPS, 2023.
  26. Taming Transformers for High-Resolution Image Synthesis. In CVPR, 2021.
  27. Deep Visual Foresight for Planning Robot Motion. In ICRA, 2017.
  28. Enhance Sample Efficiency and Robustness of End-to-End Urban Autonomous Driving via Semantic Masked World Model. In NeurIPS Workshops, 2022.
  29. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv preprint arXiv:2311.10709, 2023.
  30. Generative Adversarial Nets. In NeurIPS, 2014.
  31. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In ICLR, 2024.
  32. MaskViT: Masked Visual Pre-Training for Video Prediction. In ICLR, 2023.
  33. The Essential Role of Causality in Foundation World Models for Embodied AI. arXiv preprint arXiv:2402.06665, 2024.
  34. Language Models Represent Space and Time. In ICLR, 2024.
  35. Nicholas Guttenberg and CrossLabs. Diffusion with Offset Noise, 2023.
  36. Recurrent World Models Facilitate Policy Evolution. In NeurIPS, 2018.
  37. Dream to Control: Learning Behaviors by Latent Imagination. arXiv preprint arXiv:1912.01603, 2019.
  38. Learning Latent Dynamics for Planning from Pixels. In ICML, 2019.
  39. Mastering Atari with Discrete World Models. In ICLR, 2021.
  40. Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104, 2023.
  41. Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning. arXiv preprint arXiv:2402.14407, 2024.
  42. Latent Video Diffusion Models for High-Fidelity Long Video Generation. arXiv preprint arXiv:2211.13221, 2022.
  43. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017.
  44. Denoising Diffusion Probabilistic Models. In NeurIPS, 2020.
  45. Cascaded Diffusion Models for High Fidelity Image Generation. JMLR, 2022.
  46. Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022.
  47. Video Diffusion Models. arXiv preprint arXiv:2204.03458, 2022.
  48. Model-based Imitation Learning for Urban Driving. In NeurIPS, 2022.
  49. FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras. In ICCV, 2021.
  50. GAIA-1: A Generative World Model for Autonomous Driving. arXiv preprint arXiv:2309.17080, 2023.
  51. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR, 2022.
  52. ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning. In ECCV, 2022.
  53. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. arXiv preprint arXiv:2312.08782, 2023.
  54. Planning-Oriented Autonomous Driving. In CVPR, 2023.
  55. Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning. arXiv preprint arXiv:2312.05230, 2023.
  56. Diffusion Reward: Learning Rewards via Conditional Video Diffusion. arXiv preprint arXiv:2312.14134, 2023.
  57. ADriver-I: A General World Model for Autonomous Driving. arXiv preprint arXiv:2311.13549, 2023.
  58. DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving. In ICCV, 2023.
  59. Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving. In CVPR, 2023.
  60. VAD: Vectorized Scene Representation for Efficient Autonomous Driving. In ICCV, 2023.
  61. Elucidating the Design Space of Diffusion-based Generative Models. In NeurIPS, 2022.
  62. Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting. In CVPR, 2023.
  63. DriveGAN: Towards a Controllable High-Quality Neural Simulation. In CVPR, 2021.
  64. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  65. Learning to Act from Actionless Videos through Dense Correspondences. In ICLR, 2024.
  66. Pathdreamer: A World Model for Indoor Navigation. In ICCV, 2021.
  67. VideoPoet: A Large Language Model for Zero-Shot Video Generation. arXiv preprint arXiv:2312.14125, 2023.
  68. DreamDrone. arXiv preprint arXiv:2312.08746, 2023.
  69. XVO: Generalized Visual Odometry via Cross-Modal Self-Training. In ICCV, 2023.
  70. Yann LeCun. A Path towards Autonomous Machine Intelligence. Open Review, 62, 2022.
  71. Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future. arXiv preprint arXiv:2312.03408, 2023.
  72. Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. IEEE TPAMI, 2023.
  73. CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving. In ECCV, 2022.
  74. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. In ICLR, 2023.
  75. Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2). arXiv preprint arXiv:2402.16720, 2024.
  76. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In ECCV, 2022.
  77. Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? In CVPR, 2024.
  78. MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction. In ICLR, 2023.
  79. Learning to Model the World with Language. arXiv preprint arXiv:2308.01399, 2023.
  80. World Model on Million-Length Video and Language With RingAttention. arXiv preprint arXiv:2402.08268, 2024.
  81. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.
  82. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In ICLR, 2017.
  83. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In NeurIPS, 2022.
  84. WoVoGen: World Volume-Aware Diffusion for Controllable Multi-Camera Driving Scene Generation. arXiv preprint arXiv:2312.02934, 2023.
  85. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference. arXiv preprint arXiv:2310.04378, 2023.
  86. Structured World Models from Human Videos. In RSS, 2023.
  87. On Distillation of Guided Diffusion Models. In CVPR, 2023.
  88. Transformers are Sample-Efficient World Models. In ICLR, 2023.
  89. Deep Dynamics Models for Learning Dexterous Manipulation. In CoRL, 2020.
  90. Action-Conditional Video Prediction using Deep Networks in Atari Games. In NeurIPS, 2015.
  91. Scalable Diffusion Models with Transformers. In ICCV, 2023.
  92. Learning Real-World Robot Policies by Dreaming. In IROS, 2019.
  93. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In ICLR, 2024.
  94. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
  95. Progressive Distillation for Fast Sampling of Diffusion Models. In ICLR, 2023.
  96. Learning a Driving Simulator. arXiv preprint arXiv:1608.01230, 2016.
  97. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. arXiv preprint arXiv:2403.12015, 2024.
  98. Data-Efficient Reinforcement Learning with Self-Predictive Representations. In ICLR, 2021.
  99. ViNT: A Foundation Model for Visual Navigation. In CoRL, 2023.
  100. DriveLM: Driving with Graph Visual Question Answering. arXiv preprint arXiv:2312.14150, 2023.
  101. Make-A-Video: Text-to-Video Generation without Text-Video Data. In ICLR, 2023.
  102. Denoising Diffusion Implicit Models. In ICLR, 2021.
  103. Consistency Models. In ICML, 2023.
  104. Score-based Generative Modeling through Stochastic Differential Equations. In ICLR, 2021.
  105. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In CVPR, 2020.
  106. Richard S Sutton. The Quest for a Common Model of the Intelligent Decision Maker. arXiv preprint arXiv:2202.13252, 2022.
  107. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In NeurIPS, 2020.
  108. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717, 2018.
  109. Attention is All You Need. In NeurIPS, 2017.
  110. MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In NeurIPS, 2022.
  111. SV3D: Novel Multi-View Synthesis and 3D Generation from a Single Image using Latent Video Diffusion. arXiv preprint arXiv:2403.12008, 2024.
  112. Generating Videos with Scene Dynamics. In NeurIPS, 2016.
  113. DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation. In ICCV, 2023.
  114. ModelScope Text-to-Video Technical Report. arXiv preprint arXiv:2308.06571, 2023.
  115. VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation. arXiv preprint arXiv:2305.10874, 2023.
  116. A Recipe for Scaling up Text-to-Video Generation with Text-Free Videos. In CVPR, 2024.
  117. VideoLCM: Video Latent Consistency Model. arXiv preprint arXiv:2312.09109, 2023.
  118. DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving. arXiv preprint arXiv:2309.09777, 2023.
  119. LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models. arXiv preprint arXiv:2309.15103, 2023.
  120. Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. In CVPR, 2024.
  121. MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. arXiv preprint arXiv:2312.03641, 2023.
  122. Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting. In NeurIPS Datasets and Benchmarks, 2023.
  123. Towards A Better Metric for Text-to-Video Generation. arXiv preprint arXiv:2401.07781, 2024.
  124. Pre-Training Contextualized World Models with In-the-Wild Videos for Reinforcement Learning. In NeurIPS, 2023.
  125. Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling. In ICLR, 2023.
  126. DynamiCrafter: Animating Open-Domain Images with Video Diffusion Priors. arXiv preprint arXiv:2310.12190, 2023.
  127. A Survey on Robotics with Foundation Models: toward Embodied AI. arXiv preprint arXiv:2402.02385, 2024.
  128. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157, 2021.
  129. Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities. arXiv preprint arXiv:2401.08045, 2024.
  130. Generalized Predictive Model for Autonomous Driving. In CVPR, 2024.
  131. Learning Interactive Real-World Simulators. In ICLR, 2024.
  132. Video as the New Language for Real-World Decision Making. arXiv preprint arXiv:2402.17139, 2024.
  133. Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion. arXiv preprint arXiv:2402.03162, 2024.
  134. Visual Point Cloud Forecasting Enables Scalable Autonomous Driving. In CVPR, 2024.
  135. Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes. arXiv preprint arXiv:2305.10430, 2023.
  136. Language-Guided World Models: A Model-based Approach to AI Control. arXiv preprint arXiv:2402.01695, 2024.
  137. Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion. In ICLR, 2024.
  138. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models. arXiv preprint arXiv:2311.04145, 2023.
  139. UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. In NeurIPS, 2023.
  140. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving. arXiv preprint arXiv:2311.16038, 2023.
  141. GenAD: Generative End-to-End Autonomous Driving. arXiv preprint arXiv:2402.11502, 2024.
  142. Embodied Understanding of Driving Scenarios. arXiv preprint arXiv:2403.04593, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shenyuan Gao (9 papers)
  2. Jiazhi Yang (8 papers)
  3. Li Chen (590 papers)
  4. Kashyap Chitta (30 papers)
  5. Yihang Qiu (4 papers)
  6. Andreas Geiger (136 papers)
  7. Jun Zhang (1008 papers)
  8. Hongyang Li (99 papers)
Citations (30)