Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Training of Diffusion Models with Masked Transformers (2306.09305v2)

Published 15 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (e.g., 50%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
  3. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2022.
  4. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  5. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
  6. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022.
  7. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  8. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  9. Landan Seguin Cory Stephenson. Training Stable Diffusion from Scratch Costs <160⁢k160𝑘160k160 italic_k. https://github.com/mosaicml/diffusion-benchmark, 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
  14. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  15. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  16. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  19. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  20. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022.
  21. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  22. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
  23. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
  24. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  25. Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117, 2022.
  26. Scaling language-image pre-training via masking. arXiv preprint arXiv:2212.00794, 2022.
  27. I22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTsb: Image-to-image schrödinger bridge. arXiv preprint arXiv:2302.05872, 2023.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  30. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
  31. Generating images with sparse representations. In International Conference on Machine Learning, pages 7958–7968. PMLR, 2021.
  32. Diffusion models for adversarial purification. In International Conference on Machine Learning (ICML), 2022.
  33. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  34. Improving language understanding by generative pre-training. OpenAI, 2018.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  39. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  40. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  41. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  42. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  43. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  45. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  46. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  47. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  48. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news. Social Media+ Society, 6(1):2056305120903408, 2020.
  49. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  50. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023.
  53. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
  54. Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022.
Citations (48)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com