Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models (2410.11081v2)

Published 14 Oct 2024 in cs.LG and stat.ML

Abstract: Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512x512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512, narrowing the gap in FID scores with the best existing diffusion models to within 10%.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces the TrigFlow framework that simplifies generative modeling by integrating EDM and Flow Matching with streamlined trigonometric formulations.
It stabilizes training dynamics using identity time transformations, positional embeddings, and adaptive normalization, enabling scale-up to 1.5 billion parameters.
The study achieves competitive FID scores on benchmarks like CIFAR-10 and ImageNet, reducing sampling steps to only two while nearing the performance of state-of-the-art diffusion models.

Overview of "Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models"

The paper by Cheng Lu and Yang Song presents advancements in the domain of continuous-time consistency models (CMs) within generative modeling. Consistency models, a subset of diffusion-based generative models, are designed to address the computational inefficiencies typical in traditional diffusion models, which often require numerous steps to generate high-quality samples.

Core Contributions

Unified Framework with TrigFlow: The authors introduce the TrigFlow framework that combines elements from EDM and Flow Matching while simplifying the existing formulations. This framework ensures that parameterization and objective functions remain straightforward, often translating complex arithmetic relationships into simpler trigonometric formulations.
Stabilization of Training Dynamics: The paper identifies sources of instability in continuous-time CM training, notably in the derivative with respect to time. Through identity time transformations, positional embeddings, and adaptive normalization techniques, the stability of these models is enhanced. This stabilization allows for the effective scaling of models to 1.5 billion parameters, as demonstrated on datasets like ImageNet 512×512.
Training and Sampling Enhancements: The authors propose adaptive weighting strategies and tangent normalization to mitigate the variance in gradients. This results in better training stability and sample quality. By employing a progressive annealing strategy, the models achieve improved results using fewer computational resources.
Impressive Numerical Results: The proposed models achieve competitive Fréchet Inception Distance (FID) scores, such as 2.06 on CIFAR-10 and 1.48 on ImageNet 64×64, with only two sampling steps. These results bring the FID scores of consistency models within 10% of state-of-the-art diffusion models.

Implications and Future Directions

The innovations in parameterization, stability, and sampling efficiency underscore the potential of continuous-time CMs as viable alternatives to large-step diffusion models. The simplifications allow for more scalable model architectures without sacrificing performance, suggesting a path toward even larger and more complex models in generative AI.

Future research could focus on further optimizing computational efficiency, potentially integrating these models with emerging hardware accelerators. Additionally, exploring other domains such as video or 3D generation could reveal broader applicability.

Conclusion

This paper contributes substantially to the field of generative modeling by addressing longstanding challenges in training stability and computational efficiency. The strategies proposed around the simplified TrigFlow formulation, combined with stabilized training dynamics, present a meaningful advancement towards more practical and scalable generative models. The results demonstrate a narrowing gap with diffusion models, positioning continuous-time CMs as promising contenders in the advancement of generative technologies.