Papers
Topics
Authors
Recent
2000 character limit reached

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (2401.08740v2)

Published 16 Jan 2024 in cs.CV and cs.LG

Abstract: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Building normalizing flows with stochastic interpolants. In ICLR, 2023.
  2. Multimarginal generative modeling with stochastic interpolants. arXiv preprint arXiv:2310.03695, 2023a.
  3. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023b.
  4. Stochastic interpolants with data-dependent couplings. arXiv preprint arXiv:2310.03725, 2023c.
  5. Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 1982.
  6. Matching normalizing flows and probability paths on manifolds. In ICML, 2022.
  7. Error bounds for flow matching methods. arXiv preprint arXiv:2305.16860, 2023.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  9. Deep learning probability flows and entropy production rates in active matter. arXiv preprint arXiv:2309.12991, 2023.
  10. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
  11. Transformer-based deep learning for predicting protein properties in the life sciences. Elife, 12:e82819, 2023.
  12. Maskgit: Masked generative image transformer. In CVPR, 2022.
  13. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In ICML, 2023a.
  14. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In ICLR, 2023b.
  15. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for DDIM-type samplers. In ICML, 2023c.
  16. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  17. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023.
  18. Diffusion schrödinger bridge with applications to score-based generative modeling. In NeurIPS, 2021.
  19. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  20. Score-based generative modeling with critically-damped langevin diffusion. In ICLR, 2022.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  23. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  24. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  25. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  26. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
  27. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023.
  28. Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. JMLR, 2005.
  29. Aapo Hyvärinen. Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation. Neural Computation, 1999.
  30. Scalable adaptive computation for iterative generation. In ICML, 2023.
  31. Farm3D: Learning articulated 3d animals by distilling 2d diffusion. In 3DV, 2024.
  32. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  33. Patrick Kidger. On Neural Differential Equations. PhD thesis, University of Oxford, 2021.
  34. Adam: A method for stochastic optimization. In ICLR, 2015.
  35. Understanding the diffusion objective as a weighted integral of elbos. arXiv preprint arXiv:2303.00848, 2023.
  36. Variational diffusion models. In NeurIPS, 2021.
  37. Convergence for score-based generative modeling with polynomial complexity. In NeurIPS, 2022.
  38. Convergence of score-based generative modeling for general data distributions. In ALT, 2023.
  39. Flow matching for generative modeling. In ICLR, 2023.
  40. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023a.
  41. Let us build bridges: Understanding and extending diffusion generative models. arXiv preprint arXiv:2208.14699, 2022.
  42. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023b.
  43. Decoupled weight decay regularization. In ICLR, 2019.
  44. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  45. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  46. Improved denoising diffusion probabilistic models. In ICML, 2021.
  47. Image Transformer. In ICML, 2018.
  48. Scalable diffusion models with transformers. In ICCV, 2023.
  49. Stefano Peluchetti. Non-denoising forward-time diffusions. In ICLR, 2022.
  50. Multisample flow matching: Straightening flows with minibatch couplings. In ICML, 2023.
  51. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  52. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  53. Imagenet large scale visual recognition challenge. IJCV, 2015.
  54. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  55. Stylegan-xl: Scaling stylegan to large diverse datasets. In SIGGRAPH, 2022.
  56. Diffusion schrödinger bridge matching. In NeurIPS, 2023.
  57. Noise removal via bayesian wavelet coring. In ICIP, 1996.
  58. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. In ICLR, 2023.
  59. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  60. Denoising diffusion implicit models. In ICLR, 2021a.
  61. Maximum likelihood training of score-based diffusion models. In NeurIPS, 2021b.
  62. Score-based generative modeling through stochastic differential equations. In ICLR, 2021c.
  63. Consistency models. In ICML, 2023.
  64. Improving and generalizing flow-based generative models with minibatch optimal transport. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023.
  65. Score-based generative modeling in latent space. In NeurIPS, 2021.
  66. Attention is all you need. In NeurIPS, 2017.
  67. A Self-Attention Ansatz for Ab-initio Quantum Chemistry. In ICLR, 2023.
  68. Learning deep transformer models for machine translation. In ACL, 2019.
  69. Big Bird: Transformers for Longer Sequences. In NeurIPS, 2020.
  70. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023a.
  71. Improved techniques for maximum likelihood estimation for diffusion odes. In ICML, 2023b.
Citations (86)

Summary

  • The paper presents a novel SiT framework that uses interpolant transformers to outperform DiT on conditional ImageNet benchmarks.
  • It employs continuous time learning and a velocity prediction strategy to optimally connect data and noise distributions.
  • Customizable techniques, such as classifier-free guidance and tunable diffusion coefficients, enable post-training adaptability across imaging tasks.

Introduction

The development of generative models is a crucial aspect of AI research, particularly in the field of image generation. One of the promising avenues in this field involves diffusion models, which have achieved impressive results by transforming data into noise and learning to reverse this transformation to generate new samples. The work presented in this paper revolves around the "Scalable Interpolant Transformers" (SiT), built on the novel concept of Diffusion Transformers (DiT), enhancing the concept of dynamical transport connections between distributions.

Model Overview

SiT makes a leap forward by implementing an innovative interpolant framework, enabling a more flexible connection between source and target distributions. The authors scrutinize various design elements, such as the transition between discrete and continuous time learning, model objectives, the selection of interpolants between distributions, and the choice between deterministic and stochastic samplers. This meticulous examination leads to the discovery that SiT, while maintaining the same size and computational budget as DiT, consistently outperforms its counterpart across a range of model sizes on the conditional ImageNet 256x256 benchmark.

Performance Analysis

One of the profound insights from SiT is that performance can be substantially improved by judicious choices regarding the modeling of time, the learning objective, and the specific interpolant connecting the data and noise distributions. Additionally, sampling methods post-training can be either deterministic or stochastic, with the decision being made independently of the learning process. The study finds that the optimal performance on large-scale image generation is achieved by leveraging continuous time learning, a velocity model for prediction, a Linear interpolant for the distribution connection, and a specific diffusion coefficient for the stochastic sampler.

Further explorations

Notably, the framework allows substantial customization post-learning. By altering the diffusion coefficient used during stochastic sampling, which is not tied to the forward noising process, the performance can be fine-tuned to improve results. Moreover, SiT benefits from classifier-free guidance, a technique that enhances the generative performance, indicating the models' broader applicability and potential for adaptation in various imaging tasks. This work opens up avenues for future research to explore SiT's application across different domains and tasks beyond image generation.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 242 likes about this paper.