DiVE: DiT-based Video Generation with Enhanced Control (2409.01595v1)
Abstract: Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
- Pixart-δ𝛿\deltaitalic_δ: Fast and controllable image generation with latent consistency models. arXiv preprint arXiv:2401.05252, 2024.
- Magicdrive: Street view generation with diverse 3d geometry control. In ICLR, 2024.
- Musiq: Multi-scale image quality transformer. In ICCV, pages 5148–5157, 2021.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023.
- Unleashing generalization of end-to-end autonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024.
- Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
- DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
- Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In CVPR, pages 3621–3631, 2023.
- Open-sora: Democratizing efficient video production for all, 2024.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.