Papers
Topics
Authors
Recent
2000 character limit reached

DiVE: DiT-based Video Generation with Enhanced Control (2409.01595v1)

Published 3 Sep 2024 in cs.CV

Abstract: Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
  2. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  3. Pixart-δ𝛿\deltaitalic_δ: Fast and controllable image generation with latent consistency models. arXiv preprint arXiv:2401.05252, 2024.
  4. Magicdrive: Street view generation with diverse 3d geometry control. In ICLR, 2024.
  5. Musiq: Multi-scale image quality transformer. In ICCV, pages 5148–5157, 2021.
  6. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
  7. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023.
  8. Unleashing generalization of end-to-end autonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024.
  9. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
  10. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
  11. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023.
  12. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  13. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  14. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  15. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In CVPR, pages 3621–3631, 2023.
  16. Open-sora: Democratizing efficient video production for all, 2024.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.