You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs
Abstract: Recently, some works have tried to combine diffusion and Generative Adversarial Networks (GANs) to alleviate the computational cost of the iterative denoising inference in Diffusion Models (DMs). However, existing works in this line suffer from either training instability and mode collapse or subpar one-step generation learning efficiency. To address these issues, we introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis with high training stability and mode coverage. Specifically, we smooth the adversarial divergence by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we extend our YOSO to one-step text-to-image generation based on pre-trained models by several effective training techniques (i.e., latent perceptual loss and latent discriminator for efficient training along with the latent DMs; the informative prior initialization (IPI), and the quick adaption stage for fixing the flawed noise scheduler). Experimental results show that YOSO achieves the state-of-the-art one-step generation performance even with Low-Rank Adaptation (LoRA) fine-tuning. In particular, we show that the YOSO-PixArt-$\alpha$ can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only ~10 A800 days for fine-tuning. Our code is provided at https://github.com/Luo-Yihong/YOSO.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
- Resampled priors for variational autoencoders. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 66–75. PMLR, 2019.
- Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
- Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts, 2023.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Learning probabilistic models from generator latent spaces with hat EBM. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Non-adversarial image synthesis with generative latent nearest neighbors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5811–5819, 2019.
- Deep feature consistent variational autoencoder. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 1133–1141. IEEE, 2017.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Scaling up gans for text-to-image synthesis, 2023.
- Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29:4743–4751, 2016.
- Learning hierarchical priors in vaes. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2866–2875, 2019.
- On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132, 2021.
- Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
- Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
- Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
- Lcm-lora: A universal stable-diffusion acceleration module, 2023.
- Energy-calibrated vae with test time free lunch, 2024.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.
- Journeydb: A benchmark for generative image understanding, 2023.
- Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
- Adversarial diffusion distillation, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2024.
- Consistency models. 2023.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
- Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.
- Cooperative learning of energy-based model and latent variable model via mcmc teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10441–10451, 2021.
- A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations, 2022.
- Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023.
- Semi-implicit denoising diffusion models (siddms). arXiv preprint arXiv:2306.12511, 2023.
- Ufogen: You forward once large scale text-to-image generation via diffusion gans. ArXiv, abs/2311.09257, 2023.
- Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.