Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference (2310.04378v1)

Published 6 Oct 2023 in cs.CV and cs.LG

Abstract: Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

PDF HTML Abstract

Overview of Latent Consistency Models

Latent Diffusion Models (LDMs), such as Stable Diffusion, have shown remarkable capabilities in generating high-resolution images based on textual descriptions. Nevertheless, their iterative reverse sampling process tends to be slow, which is not ideal for real-time applications. Latent Consistency Models (LCMs) present an innovative approach to fast, high-resolution image generation by reducing the number of required sampling steps significantly.

Distillation for Few-step Inference

LCMs operate by performing a one-stage guided distillation process that solves an augmented probability flow ODE (PF-ODE) directly in latent space. This novel method allows LCMs to predict high-fidelity sample outcomes in just a few steps—or even in a single step—from pre-trained LDMs. An efficient training regimen enables a quality 768-resolution LCM to be trained using only 32 A100 GPU hours, and the proposed Skipping-Step technique further accelerates convergence during the distillation process.

Fine-tuning on Custom Datasets

The paper also introduces Latent Consistency Fine-tuning (LCF), which enables a pre-trained LCM to be adapted efficiently to customized image datasets, maintaining the model's rapid inference capability. LCF demonstrates practical utility for downstream tasks, where LCMs must be tailored to specific styles or content without the need for a teacher diffusion model trained specifically on the new dataset.

Evaluation Results

Evaluation on the LAION-5B-Aesthetics dataset confirmed that LCMs achieve state-of-the-art text-to-image generation with reduced inference steps. Notably, LCMs outperform other methods, including baselines in the DDIM series and Guided-Distill, particularly in low-step inference scenarios, maintaining compelling balance between image quality and generation speed.

Conclusion and Future Work

In summary, LCMs emerge as a promising solution for fast and high-quality image generation from text. They inherit the strengths of diffusion-based generative models while shedding the limitations of lengthy iterative processes. Prospects for future research include expanding LCM applications to additional image synthesis tasks like editing, inpainting, and super-resolution, broadening the model's utility in real-world scenarios.