SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2401.08740v2)

Published 16 Jan 2024 in cs.CV and cs.LG

Abstract: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively.

Abstract PDF HTML Chat (Pro)

References (71)

Citations (86)

View on Semantic Scholar

Summary

The paper presents a novel SiT framework that uses interpolant transformers to outperform DiT on conditional ImageNet benchmarks.
It employs continuous time learning and a velocity prediction strategy to optimally connect data and noise distributions.
Customizable techniques, such as classifier-free guidance and tunable diffusion coefficients, enable post-training adaptability across imaging tasks.

Introduction

The development of generative models is a crucial aspect of AI research, particularly in the field of image generation. One of the promising avenues in this field involves diffusion models, which have achieved impressive results by transforming data into noise and learning to reverse this transformation to generate new samples. The work presented in this paper revolves around the "Scalable Interpolant Transformers" (SiT), built on the novel concept of Diffusion Transformers (DiT), enhancing the concept of dynamical transport connections between distributions.

Model Overview

SiT makes a leap forward by implementing an innovative interpolant framework, enabling a more flexible connection between source and target distributions. The authors scrutinize various design elements, such as the transition between discrete and continuous time learning, model objectives, the selection of interpolants between distributions, and the choice between deterministic and stochastic samplers. This meticulous examination leads to the discovery that SiT, while maintaining the same size and computational budget as DiT, consistently outperforms its counterpart across a range of model sizes on the conditional ImageNet 256x256 benchmark.

Performance Analysis

One of the profound insights from SiT is that performance can be substantially improved by judicious choices regarding the modeling of time, the learning objective, and the specific interpolant connecting the data and noise distributions. Additionally, sampling methods post-training can be either deterministic or stochastic, with the decision being made independently of the learning process. The study finds that the optimal performance on large-scale image generation is achieved by leveraging continuous time learning, a velocity model for prediction, a Linear interpolant for the distribution connection, and a specific diffusion coefficient for the stochastic sampler.

Further explorations

Notably, the framework allows substantial customization post-learning. By altering the diffusion coefficient used during stochastic sampling, which is not tied to the forward noising process, the performance can be fine-tuned to improve results. Moreover, SiT benefits from classifier-free guidance, a technique that enhances the generative performance, indicating the models' broader applicability and potential for adaptation in various imaging tasks. This work opens up avenues for future research to explore SiT's application across different domains and tasks beyond image generation.