Consistency Models Made Easy (2406.14548v2)

Published 20 Jun 2024 in cs.LG and cs.CV

Abstract: Consistency models (CMs) offer faster sampling than traditional diffusion models, but their training is resource-intensive. For example, as of 2024, training a state-of-the-art CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an effective scheme for training CMs that largely improves the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs. We can thus fine-tune a consistency model starting from a pretrained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly reduced training times while improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained for hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling laws of CMs under ECT, showing that they obey the classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Our code (https://github.com/locuslab/ect) is publicly available, making CMs more accessible to the broader community.

Authors (5)

Zhengyang Geng (17 papers)
Ashwini Pokle (9 papers)
William Luo (1 paper)
Justin Lin (10 papers)
J. Zico Kolter (151 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces Easy Consistency Tuning (ECT), a method that leverages diffusion pretraining to significantly speed up consistency model training while enhancing sample quality.
It reformulates consistency model training by applying a differential consistency condition, unifying diffusion and consistency models with fewer discretization constraints.
ECT empirically achieves state-of-the-art efficiency, demonstrating a 2-step FID of 2.73 on CIFAR-10 while substantially reducing computational costs.

An In-depth Analysis of "Consistency Models Made Easy"

The paper Consistency Models Made Easy by Zhengyang Geng et al., builds on the foundational concepts of diffusion models (DMs) and introduces a more computationally efficient approach to training consistency models (CMs). The authors propose a new training strategy termed Easy Consistency Tuning (ECT), which promises to significantly accelerate the training process of CMs while achieving state-of-the-art generative performance.

Consistency models, akin to diffusion models, offer a novel method for generating high-quality data samples. Traditional diffusion models operate by gradually transforming data distributions into a prior distribution (e.g., Gaussian noise) using a stochastic differential equation (SDE). Sampling from such models demands numerous model evaluations, often leading to high computational costs. To mitigate this, researchers have explored various fast samplers and distillation methods, albeit with trade-offs in sample quality.

Key Contributions

The paper makes several key contributions:

Reformulating CM Training: The authors illustrate how the differential consistency condition can be applied to consistency models. By reinterpreting CM trajectories via specific differential equations, they highlight that diffusion models can be viewed as a special case of CMs with less stringent discretization requirements.
Easy Consistency Tuning (ECT): ECT is proposed as a simplified, more efficient training scheme. This method leverages a continuous-time schedule that transitions progressively from diffusion pretraining to a tighter consistency condition. This interpolation allows ECT to start from a pre-trained diffusion model, reducing the initial training cost.
Dropout and Adaptive Weighting: The paper emphasizes that dropout consistency across noise levels can balance gradient flows, significantly improving CM training dynamics. The inclusion of adaptive weighting functions is also shown to reduce the variance of gradients and accelerate convergence.
Scaling Laws and Practical Efficiency: Through extensive experiments, ECT demonstrates classic power-law scaling in training compute, indicating robustness and adaptability to larger datasets and model scales. The authors provide a computationally efficient path to state-of-the-art performance, exemplified by achieving a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU.

Detailed Insights

Efficiency and Scalability

ECT's efficiency is evident in the sharp reduction of training and sampling costs without compromising on sample quality. The paper reports a considerable decrease in training FLOPs, with ECT achieving notable results on benchmark datasets CIFAR-10 and ImageNet 64 $\times$ 64. The method utilizes a fine-tuning approach from pre-trained diffusion models, manifesting in improved sample quality while utilizing significantly lesser computational resources compared to iCT and other previous methods.

Theoretical Implications

The differential consistency condition provides a robust theoretical framework that redefines how consistency models are trained. The method not only simplifies the understanding of CM training but also offers a practical path to leveraging preexisting diffusion models. This reconceptualization aligns with dynamical systems theory and bridges the gap between diffusion modeling and consistency training.

Practical Relevance

By optimizing the generative models' training process, the paper lays the groundwork for practical applications where computational resources are constrained. For instance, applications in creative industries, where generating high-quality visual content swiftly is paramount, can benefit immensely from the efficiencies introduced by ECT. Furthermore, the robustness of scaling laws proposed by the authors suggests potential for wider adoption in various domains requiring large-scale data generation.

Future Directions

Potential areas of exploration suggested by the findings include:

Parameter Efficient Fine-Tuning (PEFT): Given ECT's efficiency, implementing PEFT techniques could further reduce computational demands while maintaining generative quality.
Volume Tuning on Different Data: Investigating how tuning consistency models on data distinct from those used in pretraining affects generalization capabilities merits further research.
Cross-domain Applications: Examining the adaptability of ECT in domains beyond image generation, such as video or 3D object synthesis, could expand its applicability.

Conclusion

Consistency Models Made Easy presents a pivotal advancement in the efficient training of consistency models. The introduction of ECT provides a streamlined approach that reduces computational overheads and harnesses the strengths of both diffusion models and consistency training. Theoretical rigor combined with empirical validation positions this work as a significant stride towards more practical and scalable generative modeling solutions.

PDF Markdown

Related Papers

GitHub

GitHub - locuslab/ect: Consistency Models Made Easy (213 stars)

Tweets

https://twitter.com/iScienceLuvr/status/1803990485486440849

https://twitter.com/arxivsanitybot/status/1804144748887970176