Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion Models Without Attention (2311.18257v1)

Published 30 Nov 2023 in cs.CV and cs.LG

Abstract: In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.

In recent times, the field of artificial intelligence has seen remarkable progress in generating high-quality images using generative models. A prominent type of generative model that has gained significant attention is the Denoising Diffusion Probabilistic Model (DDPM). DDPMs are known for transforming simple patterns of noise into intricate images by iteratively refining their details. Despite their success, a major challenge facing DDPMs is the vast computational resources they require, especially when generating high-resolution images. This is mainly due to the use of self-attention mechanisms, which, although powerful, increase computational demands substantially.

In light of this computational bottleneck, the introduction of the Diffusion State Space Model (DIFFUSSM) presents a groundbreaking development in diffusion-based image generation. DIFFUSSM operates without depending on the attention mechanisms that have been a staple in most high-performing DDPMs. Instead, it utilizes a gated state space model as its backbone, which is capable of processing detailed image data efficiently. What sets DIFFUSSM apart from its predecessors is its elimination of the need for attention mechanisms and its avoidance of global representation compression practices such as patchification or multi-scale layers, which often lead to a loss of spatial detail and structural integrity of the generated images.

DIFFUSSM's approach is both efficient and scalable, capable of generating high-resolution, photorealistic images while maintaining the finer details throughout the diffusion process. This is achieved by employing a structure that alternates between long-range state space model cores and strategically designed feed-forward networks, arranged in an hourglass shape. The design targets the asymptotic complexity of the sequence as well as the practical efficiency of the network.

To put DIFFUSSM to the test, comprehensive evaluations were conducted on well-known datasets such as ImageNet and LSUN. The performance on both datasets confirmed that DIFFUSSM is either on par or surpasses existing diffusion models in various resolutions when measured using the Frechet Inception Distance (FID) and Inception Score metrics, while simultaneously reducing total floating-point operation (FLOP) usage significantly.

Moreover, this newly proposed architecture fosters further exploration into longer-range and higher-fidelity applications beyond image generation. Examples include audio, video, or 3D modeling, where efficient handling of long sequences is crucial. The removal of the self-attention bottleneck in diffusion models by DIFFUSSM points towards an array of future possibilities in generative model applications.

In summary, DIFFUSSM stands as an innovative leap in the efficient and scalable generation of high-resolution images, marking significant progress in the field of generative models without relying on attention mechanisms. This not only enhances computational efficiency but also maintains, if not improves, the quality of the generative process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  2. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  3. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  4. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  5. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021.
  10. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  11. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR, 2022.
  12. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  13. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  14. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  15. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  16. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  19. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  20. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  21. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  22. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  23. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
  24. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  25. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  26. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  27. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  28. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  29. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. arXiv preprint arXiv:2106.05527, 2021.
  30. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  31. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  33. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
  34. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  35. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  36. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  37. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  38. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  39. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  40. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  41. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  42. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  43. Hierarchically gated recurrent neural network for sequence modeling. arXiv preprint arXiv:2311.04823, 2023.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  46. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  47. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  48. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
  49. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492, 2021.
  50. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  51. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023.
  52. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
  53. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  54. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  55. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  56. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  57. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  58. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  59. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  60. Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447. PMLR, 2020.
  61. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
  62. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  63. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  64. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020a.
  65. Deep learning for image super-resolution: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(10):3365–3387, 2020b.
  66. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  67. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  68. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021.
  69. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jing Nathan Yan (11 papers)
  2. Jiatao Gu (84 papers)
  3. Alexander M. Rush (115 papers)
Citations (50)
Youtube Logo Streamline Icon: https://streamlinehq.com