Diffusion Models Without Attention (2311.18257v1)

Published 30 Nov 2023 in cs.CV and cs.LG

Abstract: In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.

PDF HTML Abstract

In recent times, the field of artificial intelligence has seen remarkable progress in generating high-quality images using generative models. A prominent type of generative model that has gained significant attention is the Denoising Diffusion Probabilistic Model (DDPM). DDPMs are known for transforming simple patterns of noise into intricate images by iteratively refining their details. Despite their success, a major challenge facing DDPMs is the vast computational resources they require, especially when generating high-resolution images. This is mainly due to the use of self-attention mechanisms, which, although powerful, increase computational demands substantially.

In light of this computational bottleneck, the introduction of the Diffusion State Space Model (DIFFUSSM) presents a groundbreaking development in diffusion-based image generation. DIFFUSSM operates without depending on the attention mechanisms that have been a staple in most high-performing DDPMs. Instead, it utilizes a gated state space model as its backbone, which is capable of processing detailed image data efficiently. What sets DIFFUSSM apart from its predecessors is its elimination of the need for attention mechanisms and its avoidance of global representation compression practices such as patchification or multi-scale layers, which often lead to a loss of spatial detail and structural integrity of the generated images.

DIFFUSSM's approach is both efficient and scalable, capable of generating high-resolution, photorealistic images while maintaining the finer details throughout the diffusion process. This is achieved by employing a structure that alternates between long-range state space model cores and strategically designed feed-forward networks, arranged in an hourglass shape. The design targets the asymptotic complexity of the sequence as well as the practical efficiency of the network.

To put DIFFUSSM to the test, comprehensive evaluations were conducted on well-known datasets such as ImageNet and LSUN. The performance on both datasets confirmed that DIFFUSSM is either on par or surpasses existing diffusion models in various resolutions when measured using the Frechet Inception Distance (FID) and Inception Score metrics, while simultaneously reducing total floating-point operation (FLOP) usage significantly.

Moreover, this newly proposed architecture fosters further exploration into longer-range and higher-fidelity applications beyond image generation. Examples include audio, video, or 3D modeling, where efficient handling of long sequences is crucial. The removal of the self-attention bottleneck in diffusion models by DIFFUSSM points towards an array of future possibilities in generative model applications.

In summary, DIFFUSSM stands as an innovative leap in the efficient and scalable generation of high-resolution images, marking significant progress in the field of generative models without relying on attention mechanisms. This not only enhances computational efficiency but also maintains, if not improves, the quality of the generative process.