Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stable Consistency Tuning: Understanding and Improving Consistency Models (2410.18958v3)

Published 24 Oct 2024 in cs.LG and cs.CV

Abstract: Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024.
  2. Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248, 2023.
  3. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  5. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.
  6. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  7. Riemannian flow matching on general geometries. arXiv preprint arXiv:2302.03660, 2023.
  8. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp.  9916–9926, 2019.
  9. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  10. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  11. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  12. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
  13. Consistency models made easy, 2024. URL https://arxiv.org/abs/2406.14548.
  14. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  15. BOOT: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {{\{{\\\backslash\&}}\}} Generative Modeling, 2023.
  16. Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
  17. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  18. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  19. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  20. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  21. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
  22. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020a.
  23. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2020b.
  24. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7.
  25. Guiding a diffusion model with a bad version of itself. arXiv preprint arXiv:2406.02507, 2024a.
  26. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  24174–24184, 2024b.
  27. Refining generative process with discriminator guidance in score-based diffusion models. arXiv preprint arXiv:2211.17091, 2022.
  28. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  29. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021a.
  30. On density estimation with diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), NeurIPS, 2021b. URL https://openreview.net/forum?id=2LdBqxc1Yv.
  31. Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31, pp.  10215–10224, 2018.
  32. Act-diffusion: Efficient adversarial consistency training for one-step diffusion models, 2024.
  33. Cllms: Consistency large language models, 2024.
  34. Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  35. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024.
  36. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8576–8588, 2024.
  37. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
  38. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
  39. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  40. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  41. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. arXiv preprint arXiv:2305.18455, 2023.
  42. Osv: One step is enough for high-quality image to video generation. arXiv preprint arXiv:2409.11367, 2024.
  43. On distillation of guided diffusion models. In CVPR, pp.  14297–14306, 2023.
  44. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  45. Scalable diffusion models with transformers. In ICCV, pp.  4195–4205, October 2023.
  46. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  47. Consistency policy: Accelerated visuomotor policies via consistency distillation, 2024.
  48. Learning transferable visual models from natural language supervision. In ICLR, pp.  8748–8763. PMLR, 2021.
  49. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  50. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  51. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In ICLR, pp.  30105–30118. PMLR, 2023a.
  52. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023b.
  53. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers, pp.  1–11, 2024.
  54. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  55. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  56. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  57. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  58. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp.  11918–11930, 2019.
  59. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  60. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  61. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  62. Reinforcement learning: An introduction. MIT press, 2018.
  63. NVAE: A deep hierarchical variational autoencoder. In Advances in neural information processing systems, 2020.
  64. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  65. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661–1674, 2011. URL https://api.semanticscholar.org/CorpusID:5560643.
  66. Phased consistency model. arXiv preprint arXiv:2405.18407, 2024a.
  67. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024b.
  68. Rectified diffusion: Straightness is not your need in rectified flow. arXiv preprint arXiv:2410.07303, 2024c.
  69. Poisson flow generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=voV_TRqcWh.
  70. Stable target field for reduced variance score estimation in diffusion models. arXiv preprint arXiv:2302.00670, 2023.
  71. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
  72. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  73. Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022.
  74. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024.

Summary

  • The paper’s main contribution is Stable Consistency Tuning (SCT), which models the denoising process as an MDP and leverages Temporal Difference learning to reduce training variance.
  • It introduces variance-reduced learning, smoother progressive scheduling, and multistep inference, achieving state-of-the-art results on benchmarks like CIFAR-10 and ImageNet-64.
  • The study outlines practical implications for scaling generative models and suggests extensions to domains such as video generation and text-to-image synthesis.

An Overview of Stable Consistency Tuning: Advancements and Implications

The paper "Stable Consistency Tuning: Understanding and Improving Consistency Models" presents a novel framework to enhance the generation efficiency and stability of consistency models, a class of fast generative models outperforming traditional diffusion models in terms of sampling speed. The authors leverage a Markov Decision Process (MDP) perspective to elucidate the training mechanics of consistency models via Temporal Difference (TD) learning. This innovative approach not only provides deeper insights into the limitations and potential of existing training strategies but also paves the way for significant improvements in model performance.

Framework and Methodology

The crux of this research lies in modeling the denoising process of diffusion models as an MDP, framing the consistency model training as a value estimation task. This conceptual shift allows the authors to treat the training process akin to TD-learning, where the reward is conceptualized as the improvement in prediction across different timesteps. Key differences between consistency distillation, which relies on pretrained diffusion models, and direct consistency training from raw data, are highlighted in terms of their capacity for performance gains and training stability.

Building on this understanding, the authors propose Stable Consistency Tuning (SCT), which introduces several enhancements:

  1. Variance-Reduced Learning: By using the score identity, SCT reduces the variance in learning targets, leading to more stable training and improved performance. This is achieved by more accurately approximating the score function, essential for conditional generation settings as well.
  2. Improved Progressive Training Schedule: SCT employs a smoother schedule for decreasing the time interval between states in the MDP, which helps in reducing discretization errors without jeopardizing training stability.
  3. Multistep Inference Strategy: The framework is extended to multistep settings, supporting deterministic multistep sampling. An edge-skipping strategy is proposed to address optimization challenges near timestep edges, enhancing multistep model performance.
  4. Classifier-Free Guidance: The paper validates the effectiveness of guiding generation by a sub-optimal version of the model itself, drawing on approaches used in other competitive diffusion models.

Empirical Analysis

SCT demonstrates superior performance over previous consistency model approaches such as Easy Consistency Tuning (ECT) and Iterated Consistency Training (iCT). Notably, on benchmarks like CIFAR-10 and ImageNet-64, SCT surpasses existing state-of-the-art methods, achieving 1-step FIDs of 2.42 and 1.55 respectively on ImageNet-64, setting a new standard in the domain.

The numerical results are compelling, confirming that SCT not only accelerates convergence compared to its predecessors but also maintains efficiency in high-fidelity sample generation. The implementation of variance-reduced targets significantly enhances both sample quality and training robustness, especially in class-conditional settings.

Implications and Future Directions

The insights derived from modeling the training of consistency models as TD-learning open new avenues for theoretical exploration and practical application enhancement. This paper suggests several promising directions:

  • Scale and Complexity: While the current experiments focus on traditional benchmarks, extending SCT to larger scale models and applications, such as text-to-image generation, promises significant advancements in real-world deployments.
  • Framework Generalization: The MDP-based framework can be potentially generalized to other domains, including video generation and LLMs, where fast sampling with high fidelity is vital.
  • Hybrid Approaches: Combining SCT with adversarial training techniques holds the potential for generating even more realistic samples while maintaining the efficiency of one-step generation.

In conclusion, this paper significantly advances the understanding and capability of consistency models. By systematically reducing training variance and improving discretization error management, SCT sets a new benchmark in generative modeling, promising widespread impact across various AI applications. As the landscape of generative models continues to evolve, Stable Consistency Tuning offers an insightful and powerful toolset for researchers and practitioners alike.