Stable Consistency Tuning: Understanding and Improving Consistency Models (2410.18958v3)
Abstract: Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.
- Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024.
- Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248, 2023.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Riemannian flow matching on general geometries. arXiv preprint arXiv:2302.03660, 2023.
- Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp. 9916–9926, 2019.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024.
- Consistency models made easy, 2024. URL https://arxiv.org/abs/2406.14548.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- BOOT: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {{\{{\\\backslash\&}}\}} Generative Modeling, 2023.
- Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
- Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020a.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2020b.
- Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7.
- Guiding a diffusion model with a bad version of itself. arXiv preprint arXiv:2406.02507, 2024a.
- Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24174–24184, 2024b.
- Refining generative process with discriminator guidance in score-based diffusion models. arXiv preprint arXiv:2211.17091, 2022.
- Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021a.
- On density estimation with diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), NeurIPS, 2021b. URL https://openreview.net/forum?id=2LdBqxc1Yv.
- Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31, pp. 10215–10224, 2018.
- Act-diffusion: Efficient adversarial consistency training for one-step diffusion models, 2024.
- Cllms: Consistency large language models, 2024.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024.
- Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8576–8588, 2024.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
- Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
- Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. arXiv preprint arXiv:2305.18455, 2023.
- Osv: One step is enough for high-quality image to video generation. arXiv preprint arXiv:2409.11367, 2024.
- On distillation of guided diffusion models. In CVPR, pp. 14297–14306, 2023.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
- Scalable diffusion models with transformers. In ICCV, pp. 4195–4205, October 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Consistency policy: Accelerated visuomotor policies via consistency distillation, 2024.
- Learning transferable visual models from natural language supervision. In ICLR, pp. 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In ICLR, pp. 30105–30118. PMLR, 2023a.
- Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023b.
- Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers, pp. 1–11, 2024.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930, 2019.
- Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- NVAE: A deep hierarchical variational autoencoder. In Advances in neural information processing systems, 2020.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661–1674, 2011. URL https://api.semanticscholar.org/CorpusID:5560643.
- Phased consistency model. arXiv preprint arXiv:2405.18407, 2024a.
- Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024b.
- Rectified diffusion: Straightness is not your need in rectified flow. arXiv preprint arXiv:2410.07303, 2024c.
- Poisson flow generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=voV_TRqcWh.
- Stable target field for reduced variance score estimation in diffusion models. arXiv preprint arXiv:2302.00670, 2023.
- Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
- Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022.
- Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024.