Stable Consistency Tuning: Understanding and Improving Consistency Models (2410.18958v3)

Published 24 Oct 2024 in cs.LG and cs.CV

Abstract: Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

References (74)

Summary

The paper’s main contribution is Stable Consistency Tuning (SCT), which models the denoising process as an MDP and leverages Temporal Difference learning to reduce training variance.
It introduces variance-reduced learning, smoother progressive scheduling, and multistep inference, achieving state-of-the-art results on benchmarks like CIFAR-10 and ImageNet-64.
The study outlines practical implications for scaling generative models and suggests extensions to domains such as video generation and text-to-image synthesis.

An Overview of Stable Consistency Tuning: Advancements and Implications

The paper "Stable Consistency Tuning: Understanding and Improving Consistency Models" presents a novel framework to enhance the generation efficiency and stability of consistency models, a class of fast generative models outperforming traditional diffusion models in terms of sampling speed. The authors leverage a Markov Decision Process (MDP) perspective to elucidate the training mechanics of consistency models via Temporal Difference (TD) learning. This innovative approach not only provides deeper insights into the limitations and potential of existing training strategies but also paves the way for significant improvements in model performance.

Framework and Methodology

The crux of this research lies in modeling the denoising process of diffusion models as an MDP, framing the consistency model training as a value estimation task. This conceptual shift allows the authors to treat the training process akin to TD-learning, where the reward is conceptualized as the improvement in prediction across different timesteps. Key differences between consistency distillation, which relies on pretrained diffusion models, and direct consistency training from raw data, are highlighted in terms of their capacity for performance gains and training stability.

Building on this understanding, the authors propose Stable Consistency Tuning (SCT), which introduces several enhancements:

Variance-Reduced Learning: By using the score identity, SCT reduces the variance in learning targets, leading to more stable training and improved performance. This is achieved by more accurately approximating the score function, essential for conditional generation settings as well.
Improved Progressive Training Schedule: SCT employs a smoother schedule for decreasing the time interval between states in the MDP, which helps in reducing discretization errors without jeopardizing training stability.
Multistep Inference Strategy: The framework is extended to multistep settings, supporting deterministic multistep sampling. An edge-skipping strategy is proposed to address optimization challenges near timestep edges, enhancing multistep model performance.
Classifier-Free Guidance: The paper validates the effectiveness of guiding generation by a sub-optimal version of the model itself, drawing on approaches used in other competitive diffusion models.

Empirical Analysis

SCT demonstrates superior performance over previous consistency model approaches such as Easy Consistency Tuning (ECT) and Iterated Consistency Training (iCT). Notably, on benchmarks like CIFAR-10 and ImageNet-64, SCT surpasses existing state-of-the-art methods, achieving 1-step FIDs of 2.42 and 1.55 respectively on ImageNet-64, setting a new standard in the domain.

The numerical results are compelling, confirming that SCT not only accelerates convergence compared to its predecessors but also maintains efficiency in high-fidelity sample generation. The implementation of variance-reduced targets significantly enhances both sample quality and training robustness, especially in class-conditional settings.

Implications and Future Directions

The insights derived from modeling the training of consistency models as TD-learning open new avenues for theoretical exploration and practical application enhancement. This paper suggests several promising directions:

Scale and Complexity: While the current experiments focus on traditional benchmarks, extending SCT to larger scale models and applications, such as text-to-image generation, promises significant advancements in real-world deployments.
Framework Generalization: The MDP-based framework can be potentially generalized to other domains, including video generation and LLMs, where fast sampling with high fidelity is vital.
Hybrid Approaches: Combining SCT with adversarial training techniques holds the potential for generating even more realistic samples while maintaining the efficiency of one-step generation.

In conclusion, this paper significantly advances the understanding and capability of consistency models. By systematically reducing training variance and improving discretization error management, SCT sets a new benchmark in generative modeling, promising widespread impact across various AI applications. As the landscape of generative models continues to evolve, Stable Consistency Tuning offers an insightful and powerful toolset for researchers and practitioners alike.

PDF Markdown

Tweets

https://twitter.com/fywang0126/status/1849692645834342823

https://twitter.com/arXivGPT/status/1850300640984969570

https://twitter.com/arXivGPT/status/1850645523335623002

https://twitter.com/arXivGPT/status/1851005290340368582

https://twitter.com/crypto_ai_girly/status/1851241556029710386

https://twitter.com/crypto_ai_girly/status/1851442960556048772