Papers
Topics
Authors
Recent
Search
2000 character limit reached

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Published 19 Mar 2024 in cs.CV | (2403.12931v6)

Abstract: Recently, some works have tried to combine diffusion and Generative Adversarial Networks (GANs) to alleviate the computational cost of the iterative denoising inference in Diffusion Models (DMs). However, existing works in this line suffer from either training instability and mode collapse or subpar one-step generation learning efficiency. To address these issues, we introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis with high training stability and mode coverage. Specifically, we smooth the adversarial divergence by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we extend our YOSO to one-step text-to-image generation based on pre-trained models by several effective training techniques (i.e., latent perceptual loss and latent discriminator for efficient training along with the latent DMs; the informative prior initialization (IPI), and the quick adaption stage for fixing the flawed noise scheduler). Experimental results show that YOSO achieves the state-of-the-art one-step generation performance even with Low-Rank Adaptation (LoRA) fine-tuning. In particular, we show that the YOSO-PixArt-$\alpha$ can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only ~10 A800 days for fine-tuning. Our code is provided at https://github.com/Luo-Yihong/YOSO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  3. Resampled priors for variational autoencoders. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 66–75. PMLR, 2019.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  5. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  6. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
  7. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts, 2023.
  8. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  9. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  11. Learning probabilistic models from generator latent spaces with hat EBM. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  14. Non-adversarial image synthesis with generative latent nearest neighbors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5811–5819, 2019.
  15. Deep feature consistent variational autoencoder. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 1133–1141. IEEE, 2017.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  17. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  18. Scaling up gans for text-to-image synthesis, 2023.
  19. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
  20. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  21. Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29:4743–4751, 2016.
  22. Learning hierarchical priors in vaes. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2866–2875, 2019.
  23. On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132, 2021.
  24. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
  25. Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022.
  26. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
  27. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
  28. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  29. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  30. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  31. Lcm-lora: A universal stable-diffusion acceleration module, 2023.
  32. Energy-calibrated vae with test time free lunch, 2024.
  33. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  34. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  35. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
  36. Journeydb: A benchmark for generative image understanding, 2023.
  37. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  38. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  39. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  40. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  42. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  43. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  44. Adversarial diffusion distillation, 2023.
  45. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  46. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  47. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  48. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2024.
  49. Consistency models. 2023.
  50. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  51. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
  52. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.
  53. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  54. Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10441–10451, 2021.
  55. A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations, 2022.
  56. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023.
  57. Semi-implicit denoising diffusion models (siddms). arXiv preprint arXiv:2306.12511, 2023.
  58. Ufogen: You forward once large scale text-to-image generation via diffusion gans. ArXiv, abs/2311.09257, 2023.
  59. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
  60. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  61. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  62. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Citations (1)

Summary

  • The paper presents YOSO, a model that combines diffusion processes with GANs using self-cooperative learning to enable one-step text-to-image synthesis.
  • The method enhances training stability and scalability, allowing seamless adaptation from 512 to 1024 resolution while maintaining competitive image quality.
  • Experimental results demonstrate reduced computational requirements and high-fidelity image generation, highlighting its potential for real-time content creation.

Insightful Overview of "You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs"

The paper "You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs" introduces YOSO, a novel generative adversarial network architecture designed to enhance the efficiency and quality of text-to-image synthesis. YOSO integrates the diffusion process with GANs, thereby achieving instantaneous, high-fidelity image generation from text descriptions using a single inference step.

Key Contributions

The authors of this study make several significant advancements in the domain of generative models:

  1. Introduction of YOSO: The primary contribution lies in the development of YOSO, which exploits a self-cooperative learning strategy. This design smooths the data distribution through a denoising generator without requiring iterative noise adjustment steps typically seen in traditional diffusion models.
  2. Self-Cooperative Diffusion GANs: The study elaborates on the hybridization of Diffusion Models (DMs) with GANs, enhancing training stability and sample quality by leveraging self-cooperative learning. This approach allows for the efficient training of one-step generation models directly from scratch.
  3. Scalability and Flexibility: YOSO demonstrates its capability not only as a stand-alone generative model but also as a fine-tuning technique for pre-trained diffusion models. It is shown to extend seamlessly to text-to-image tasks, making it adaptable from resolutions of 512 to 1024 without explicit additional training at the higher resolution.
  4. Diffusion Transformer and LoRA Fine-Tuning: Furthermore, the research showcases the implementation of the first diffusion transformer capable of one-step image generation, and the adaptation of Low Rank Adaptation (LoRA) for these tasks, reflecting YOSO's robust flexibility.

Numerical and Experimental Insights

  • Efficiency: YOSO achieves computational efficiency by requiring approximately 10 A800 days for training, a noteworthy reduction compared to many conventional models.
  • Generative Performance: The model's image generation from scratch and through text-to-image synthesis fine-tuning proved competitive in qualitative and quantitative benchmarks, effectively maintaining image quality comparable to state-of-the-art models even in resource-constrained settings.

Theoretical and Practical Implications

The synthesis of diffusion processes with GANs opens up avenues for deploying rapid, high-quality image generation in practical applications, potentially transforming tasks such as real-time video synthesis, virtual reality environment creation, and user-driven content generation across multimedia platforms. The hybrid model emboldens theoretical perspectives on convergence, training stability, and scalability, offering a blueprint for future advanced generative models.

Speculations on Future Developments

Future developments following this research could entail:

  • Adaptation to Larger Models and Datasets: As computational resources grow, scaling YOSO to larger models could further narrow the quality gap between one-step and multi-step synthesis models.
  • Integration with Automated Machine Learning Techniques: Leveraging AutoML could calibrate YOSO's parameters and configurations automatically for diverse datasets, optimizing model performance without extensive manual intervention.
  • Enhanced Conditional Controls and Customization: Refinement in controllable attributes could yield more precise outcomes, aligning closely with specified text prompts or contextual requirements, making the model usable in highly-customized content generation tasks.

In conclusion, this paper's exploration of integrating diffusion processes with GAN architectures shines a promising light on the future of efficient and scalable generative models, offering a substantial enhancement over current methods in speed, quality, and practicality.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. GitHub - Luo-Yihong/YOSO (71 stars)