Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention (2405.18428v2)

Published 28 May 2024 in cs.CV and cs.AI
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Abstract: Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at $256 \times 256$ resolution, DiG demonstrates greater efficiency than these methods starting from a $512$ resolution. Specifically, DiG-S/2 is $2.5\times$ faster and saves $75.7\%$ GPU memory compared to DiT-S/2 at a $1792$ resolution. Additionally, DiG-XL/2 is $4.2\times$ faster than the Mamba-based model at a $1024$ resolution and $1.8\times$ faster than DiT with FlashAttention-2 at a $2048$ resolution. We will release the code soon. Code is released at https://github.com/hustvl/DiG.

An Overview of Diffusion Gated Linear Attention Transformers (DiG)

The paper presents a notable advancement in the field of visual content generation through diffusion models. At its core, the paper introduces Diffusion Gated Linear Attention Transformers (DiG), a novel architecture designed to overcome the scalability and efficiency limitations often encountered with traditional Diffusion Transformers (DiT).

Core Contributions

The principal objective of the research is to enhance the scalability and computational efficiency of diffusion models by integrating the long sequence modeling capabilities of Gated Linear Attention (GLA) Transformers into the diffusion framework. The resultant DiG model is positioned as a more efficient alternative to the generic DiT, demonstrating significant improvements in both processing speed and resource consumption.

Key contributions of this work include:

  1. Introduction of DiG Model: The DiG model is conceptualized by leveraging GLA Transformers, addressing the quadratic complexity challenge in traditional diffusion models.
  2. Efficiency Gains: DiG-S/2 achieves a 2.5×2.5\times increase in training speed compared to DiT-S/2 and exhibits a 75.7%75.7\% reduction in GPU memory usage for high-resolution images (1792×17921792 \times 1792).
  3. Scalability Analysis: The paper methodically analyzes the scalability of DiG across various computational complexities, demonstrating consistent performance improvements (decreasing FID) with increased model depth/width and input tokens.
  4. Comparative Efficiency: DiG-XL/2 outperforms the Mamba-based diffusion model by being 4.2×4.2\times faster at $1024$ resolution and is 1.8×1.8\times faster than a CUDA-optimized DiT, utilizing FlashAttention-2 at $2048$ resolution.

Methodological Advances

The methodology integrates the linear complexity benefits of GLA Transformers into the diffusion model paradigm, thereby constructing a more efficient architecture without significantly altering the underlying design of DiT. This alteration results in minimal parameter overhead while achieving notable improvements in performance and computational efficiency. DiG's architectural adjustments ensure that it remains highly adoptable and effective, particularly for applications requiring high-resolution image synthesis.

Experimental Validation

Extensive experimentation validates the performance claims of the proposed model. Key experimental results include:

  • DiG-S/2 not only improved training speeds but also significantly reduced GPU memory consumption compared to baseline models.
  • The scalability tests confirmed that increasing the model's depth or width, alongside augmenting input tokens, consistently yielded better performance metrics, specifically lower FID scores.
  • Comparative tests positioned DiG as markedly more efficient than contemporary subquadratic-time diffusion models, solidifying its practical utility in high-resolution visual content generation tasks.

Practical and Theoretical Implications

From a practical perspective, the development of DiG holds substantial implications for large-scale visual content generation. The reduced computational overhead makes it feasible to generate higher quality visuals without proportional resource scaling, potentially democratizing high-resolution visual content creation across diverse application domains.

Theoretically, the integration of GLA within diffusion models opens new avenues for exploring other low-complexity attention mechanisms within advanced machine learning frameworks. This direction fosters an ongoing exploration into combining different architectural efficiencies without compromising model effectiveness.

Future Directions

Looking ahead, further enhancements to DiG could involve:

  • Incorporating additional optimization techniques specific to GLA mechanisms.
  • Exploring hybrid architectures that combine the strengths of DiG with other emerging efficient transformer models.
  • Investigating the application of DiG in broader domains beyond visual content generation, such as natural language processing or complex pattern recognition tasks.

This paper lays a foundation for the future development of efficient, scalable diffusion models, enhancing the computational feasibility and broadening the accessibility of high-quality visual content generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  2. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  3. Video generation models as world simulators. 2024.
  4. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering, 2024.
  5. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024.
  6. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  7. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  8. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  9. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  10. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  11. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  16. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
  17. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  19. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  20. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  21. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
  22. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  23. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  24. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  25. Finetuning pretrained transformers into rnns. arXiv preprint arXiv:2103.13076, 2021.
  26. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  27. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  28. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  29. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
  30. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  31. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  32. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  33. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  34. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. arXiv preprint arXiv:2210.04243, 2022.
  35. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9117–9125, 2023.
  36. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  37. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  38. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  39. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  40. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  41. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  42. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023.
  43. Searching for activation functions, 2017.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  47. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  48. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  49. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  50. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  51. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  52. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  53. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023.
  56. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  57. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
  58. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  59. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  60. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
  61. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Lianghui Zhu (12 papers)
  2. Zilong Huang (42 papers)
  3. Bencheng Liao (20 papers)
  4. Jun Hao Liew (29 papers)
  5. Hanshu Yan (28 papers)
  6. Jiashi Feng (295 papers)
  7. Xinggang Wang (163 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. GitHub - hustvl/DiG (112 stars)