Scalable Diffusion Models with State Space Backbone (2402.05608v3)

Published 8 Feb 2024 in cs.CV and cs.MM

Abstract: This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space. Given its notable efficacy in accommodating long-range dependencies, Diffusion State Space Models (DiS) are distinguished by treating all inputs including time, condition, and noisy image patches as tokens. Our assessment of DiS encompasses both unconditional and class-conditional image generation scenarios, revealing that DiS exhibits comparable, if not superior, performance to CNN-based or Transformer-based U-Net architectures of commensurate size. Furthermore, we analyze the scalability of DiS, gauged by the forward pass complexity quantified in Gflops. DiS models with higher Gflops, achieved through augmentation of depth/width or augmentation of input tokens, consistently demonstrate lower FID. In addition to demonstrating commendable scalability characteristics, DiS-H/2 models in latent space achieve performance levels akin to prior diffusion models on class-conditional ImageNet benchmarks at the resolution of 256$\times$256 and 512$\times$512, while significantly reducing the computational burden. The code and models are available at: https://github.com/feizc/DiS.

Authors (4)

Zhengcong Fei (27 papers)
Mingyuan Fan (35 papers)
Changqian Yu (28 papers)
Junshi Huang (24 papers)

Citations (30)

View on Semantic Scholar

Summary

An Overview of Scalable Diffusion Models with State Space Backbone

The research paper presents Diffusion State Space Models (DiS), a novel avenue in the design of diffusion models through leveraging state space architecture. The paper replaces the traditional U-Net backbones with a state space backbone to model image generation tasks, highlighting its efficacy in capturing long-range dependencies. This exploration presents DiS as a competitive alternative to CNN-based and Transformer-based architectures while achieving significant computational benefits.

Methodological Advancements

The core innovation in this paper lies in employing state space models (SSM) as the foundational backbone for diffusion models. State space models, rooted in control theory, are gaining recognition for their proficiency in sequence modeling tasks, due in part to advancements like Structured State Space Models (S4) and Mamba. Moving away from the convolutional and self-attention paradigms, DiS employs both forward and backward directional SSM blocks, tackling the long context dependencies more efficiently as evidenced through their linear scaling capabilities.

DiS models adopt patchification for data representation, converting input images to discrete tokens—a critical step that harmonizes with the sequential nature of SSMs. Furthermore, the paper investigates adaptive designs for condition incorporation, adapting to both time and conditioning labels without altering the core SSM structure. This design seamlessly integrates the inherent advantages of SSMs in capturing long-range dependencies while minimizing the computational heuristics imposed by quadratic scaling in Transformer models.

Key Results and Implications

Extensive experimental evaluations reveal that DiS achieves comparable performance metrics across a range of image generation tasks. The models display robustness across unconditional and class-conditional setups, affirming their capacity to rival the performance of established architectures within the domain. In particular, DiS outperforms several state-of-the-art models on benchmarks such as CIFAR-10 and ImageNet. The paper delineates a clear scaling advantage by illustrating a strong correlation between increased model complexity and image sample quality, underscoring the scalability of DiS models.

The implications of this research are multifaceted. Practically, the deployment of DiS models can lead to a significant reduction in computational costs relative to Transformers, particularly beneficial in applications demanding large input sizes or high resolutions, such as gigapixel image analysis. Theoretically, the results advocate the consideration of SSMs as a viable and scalable backbones for neural generative models, inspiring further explorations in diversifying architectural designs beyond traditional CNNs and Transformers.

Future Directions

The paper sets a foundational precedent for the deployment of state space backbones within generative diffusion models. Future research is poised to explore further scaling of DiS, potentially enlarging both the model size and token quantity. This can push the boundaries of what state space-based designs can achieve in terms of diversity and quality in image generation. Additionally, the scalability attributes demonstrated by DiS models invite further adaptation across larger-scale multimodal datasets, potentially broadening the applicability of diffusions models in AI-driven creative industries and scientific visualization sectors.

In conclusion, "Scalable Diffusion Models with State Space Backbone" contributes an impactful perspective to generative modeling, blending state space dynamics with diffusion processes to forge a path of reduced computational complexity paired with high-quality output. The promising results trailblaze for future developments and inspire the computational community to reevaluate longstanding paradigms governing deep generative models.

PDF Markdown

Related Papers

GitHub

GitHub - feizc/DiS: Scalable Diffusion Models with State Space Backbone (149 stars)

Tweets

https://twitter.com/CV_novel_plume/status/1757759401132232763

Reddit

[2402.05608] Scalable Diffusion Models with State Space Backbone (6 points, 1 comment)