Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving (2311.17918v1)

Published 29 Nov 2023 in cs.CV

Abstract: In autonomous driving, predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to better plan their actions, enhancing safety and efficiency on the road. To this end, we propose Drive-WM, the first driving world model compatible with existing end-to-end planning models. Through a joint spatial-temporal modeling facilitated by view factorization, our model generates high-fidelity multiview videos in driving scenes. Building on its powerful generation ability, we showcase the potential of applying the world model for safe driving planning for the first time. Particularly, our Drive-WM enables driving into multiple futures based on distinct driving maneuvers, and determines the optimal trajectory according to the image-based rewards. Evaluation on real-world driving datasets verifies that our method could generate high-quality, consistent, and controllable multiview videos, opening up possibilities for real-world simulations and safe planning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
  3. Generating long videos of dynamic scenes. NeurIPS, 35:31769–31781, 2022.
  4. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
  5. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  6. Mp3: A unified model to map, perceive, predict and plan. In CVPR, pages 14403–14412, 2021.
  7. Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In ECCV, 2022.
  8. Exploring the limitations of behavior cloning for autonomous driving. In ICCV, pages 9329–9338, 2019.
  9. Stochastic video generation with a learned prior. In ICML, pages 1174–1183. PMLR, 2018.
  10. Stochastic image-to-video synthesis using cinns. In CVPR, pages 3742–3753, 2021.
  11. CARLA: An open urban driving simulator. In CoRL, pages 1–16. PMLR, 2017.
  12. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
  13. Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
  14. Deep visual foresight for planning robot motion. ICRA, pages 2786–2793, 2016.
  15. Stylevideogan: A temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224, 2021.
  16. Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020.
  17. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023.
  18. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, pages 102–118. Springer, 2022.
  19. Recurrent world models facilitate policy evolution. NeurIPS, 31, 2018.
  20. Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
  21. Mastering atari with discrete world models. In ICLR, 2021.
  22. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  23. Flexible diffusion modeling of long videos. NeurIPS, 35:27953–27965, 2022.
  24. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
  25. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  26. Diffusion models for video prediction and infilling. TMLR, 2022.
  27. Model-based imitation learning for urban driving. NeurIPS, 35:20703–20716, 2022.
  28. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023a.
  29. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023b.
  30. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, pages 8340–8350, 2023.
  31. Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. Videoflow: A conditional flow-based model for stochastic video generation. In ICLR, 2020.
  35. Yann LeCun. A path towards autonomous machine intelligence. 2022.
  36. Gligen: Open-set grounded text-to-image generation. In CVPR, pages 22511–22521, 2023.
  37. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
  38. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2022.
  39. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  40. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804, 2022.
  41. Action-conditional video prediction using deep networks in atari games. NeurIPS, 28, 2015.
  42. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  43. A generalist agent. TMLR, 2022.
  44. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  45. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  46. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  47. Planning to explore via self-supervised world models. In ICML, pages 8583–8592, 2020.
  48. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2022.
  49. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022.
  50. Denoising diffusion implicit models. In ICLR, 2020.
  51. Loftr: Detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021.
  52. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  53. Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634, 2023.
  54. A survey of end-to-end driving: Architectures and training methods. IEEE Trans. Neural Netw. Learn. Syst., 33(4):1364–1384, 2020.
  55. Deepmind control suite, 2018.
  56. Mocogan: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
  57. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  58. High fidelity video prediction with large stochastic recurrent neural networks. NeurIPS, 32, 2019.
  59. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 35:23371–23385, 2022.
  60. Generating videos with scene dynamics. NeurIPS, 29, 2016.
  61. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021.
  62. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In ICCV, pages 3621–3631, 2023a.
  63. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023b.
  64. Scaling autoregressive video models. In ICLR, 2020.
  65. Greedy hierarchical variational autoencoders for large-scale video prediction. In CVPR, pages 2318–2328, 2021.
  66. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023a.
  67. Daydreamer: World models for physical robot learning. In CoRL, pages 2226–2240. PMLR, 2023b.
  68. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  69. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
  70. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2021.
  71. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In CVPR, pages 22490–22499, 2023.
  72. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, pages 13760–13769, 2022.
  73. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuqi Wang (62 papers)
  2. Jiawei He (41 papers)
  3. Lue Fan (26 papers)
  4. Hongxin Li (8 papers)
  5. Yuntao Chen (37 papers)
  6. Zhaoxiang Zhang (162 papers)
Citations (67)

Summary

We haven't generated a summary for this paper yet.