Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference (2310.04378v1)

Published 6 Oct 2023 in cs.CV and cs.LG

Abstract: Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

Overview of Latent Consistency Models

Latent Diffusion Models (LDMs), such as Stable Diffusion, have shown remarkable capabilities in generating high-resolution images based on textual descriptions. Nevertheless, their iterative reverse sampling process tends to be slow, which is not ideal for real-time applications. Latent Consistency Models (LCMs) present an innovative approach to fast, high-resolution image generation by reducing the number of required sampling steps significantly.

Distillation for Few-step Inference

LCMs operate by performing a one-stage guided distillation process that solves an augmented probability flow ODE (PF-ODE) directly in latent space. This novel method allows LCMs to predict high-fidelity sample outcomes in just a few steps—or even in a single step—from pre-trained LDMs. An efficient training regimen enables a quality 768-resolution LCM to be trained using only 32 A100 GPU hours, and the proposed Skipping-Step technique further accelerates convergence during the distillation process.

Fine-tuning on Custom Datasets

The paper also introduces Latent Consistency Fine-tuning (LCF), which enables a pre-trained LCM to be adapted efficiently to customized image datasets, maintaining the model's rapid inference capability. LCF demonstrates practical utility for downstream tasks, where LCMs must be tailored to specific styles or content without the need for a teacher diffusion model trained specifically on the new dataset.

Evaluation Results

Evaluation on the LAION-5B-Aesthetics dataset confirmed that LCMs achieve state-of-the-art text-to-image generation with reduced inference steps. Notably, LCMs outperform other methods, including baselines in the DDIM series and Guided-Distill, particularly in low-step inference scenarios, maintaining compelling balance between image quality and generation speed.

Conclusion and Future Work

In summary, LCMs emerge as a promising solution for fast and high-quality image generation from text. They inherit the strengths of diffusion-based generative models while shedding the limitations of lengthy iterative processes. Prospects for future research include expanding LCM applications to additional image synthesis tasks like editing, inpainting, and super-resolution, broadening the model's utility in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  2. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  3. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  4. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  5. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  6. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
  7. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  8. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  9. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
  10. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
  11. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
  12. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022a.
  13. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  14. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022.
  15. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14297–14306, 2023.
  16. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  17. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  18. Norod78. Simpsons blip captions. https://huggingface.co/datasets/Norod78/simpsons-blip-captions, 2022.
  19. Justin N. M. Pinkney. Pokemon blip captions. https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/, 2022.
  20. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  22. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  23. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  24. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  25. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  26. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  27. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  28. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  29. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
  30. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  31. Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021.
  32. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  33. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  34. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pp.  42390–42402. PMLR, 2023.
  35. Truncated diffusion probabilistic models. stat, 1050:7, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Simian Luo (9 papers)
  2. Yiqin Tan (4 papers)
  3. Longbo Huang (89 papers)
  4. Jian Li (667 papers)
  5. Hang Zhao (156 papers)
Citations (321)
Youtube Logo Streamline Icon: https://streamlinehq.com