Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation (2405.15881v1)

Published 24 May 2024 in cs.CV, cs.AI, and cs.LG

Abstract: In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Denoising diffusion probabilistic models. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020.
  2. Score-based generative modeling through stochastic differential equations. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  3. Denoising diffusion implicit models. ArXiv, 2021.
  4. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  5. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  6. One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555, 2023.
  7. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  8. Hungry Hungry Hippos: Towards language modeling with state space models. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
  9. Efficiently modeling long sequences with structured state spaces. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
  10. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  11. Diffwave: A versatile diffusion model for audio synthesis. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  12. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  13. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
  14. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648, 2023.
  15. DiT-3D: Exploring plain diffusion transformers for 3d shape generation. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), 2023.
  16. Fast training of diffusion transformer with extreme masking for 3d point clouds generation. arXiv preprint arXiv: 2312.07231, 2023.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  18. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  19. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023.
  20. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  21. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  22. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2018.
  23. Improved techniques for training gans. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), 2016.
  24. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  25. Improved precision and recall metric for assessing generative models. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), 2019.
  26. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, 2022.
  27. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 8026–8037, 2019.
  28. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  29. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
  30. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  31. A good image generator is what you need for high-resolution video synthesis. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  32. Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
  33. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18456–18466, 2023.
  34. Mostgan-v: Video generation with temporal motion styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5652–5661, 2023.
  35. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shentong Mo (56 papers)
  2. Yapeng Tian (80 papers)
Citations (8)
Reddit Logo Streamline Icon: https://streamlinehq.com