Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation (2403.01505v2)

Published 3 Mar 2024 in cs.CV

Abstract: The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality generations can be achieved with just 1-2 sampling steps, and further improvements can be obtained by adding additional steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pretrained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the sample quality with rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID (Frechet Inceptio Distance) of 22.1, surpassing that (23.4) of the 1-step InstaFlow (Liu et al., 2023) and matching that of 4-step UFOGen (Xue et al., 2023b). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation (Luo et al., 2023a), with up to 16% improvement in a qualified metric. The code and checkpoints are coming soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. A study on the evaluation of generative models. arXiv preprint arXiv:2206.10935, 2022.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  3. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023.
  4. Elucidating the solution space of extended reverse-time sde for diffusion models, 2023.
  5. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  6. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  7. Seeds: Exponential sde solvers for fast high-quality sampling from diffusion models, 2023.
  8. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  9. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  10. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  13. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  14. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  15. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  16. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  17. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  18. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
  19. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
  20. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  21. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  22. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11461–11471, 2022.
  23. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  24. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. arXiv preprint arXiv:2305.18455, 2023b.
  25. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  26. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14297–14306, 2023.
  27. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  28. Reliable fidelity and diversity metrics for generative models. In 37th International Conference on Machine Learning, ICML 2020, pp.  7133–7142. International Machine Learning Society (IMLS), 2020.
  29. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  30. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  32. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  33. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  34. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  35. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
  36. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International Conference on Machine Learning, 2023a.
  37. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023b.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  39. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  40. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020b.
  42. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  43. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  44. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  45. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020c.
  46. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
  47. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  48. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  49. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
  50. Restart sampling for improving generative processes. arXiv preprint arXiv:2306.14878, 2023a.
  51. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023b.
  52. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2022.
  53. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hongjian Liu (3 papers)
  2. Qingsong Xie (16 papers)
  3. Zhijie Deng (58 papers)
  4. Chen Chen (753 papers)
  5. Shixiang Tang (48 papers)
  6. Fueyang Fu (1 paper)
  7. Haonan Lu (35 papers)
  8. Zheng-Jun Zha (144 papers)
Citations (4)