One-Step Diffusion Distillation via Deep Equilibrium Models (2401.08639v1)
Abstract: Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.
- Donald G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, October 1965.
- Path independent equilibrium models can better exploit test-time computation. Advances in Neural Information Processing Systems, 35:7796–7809, 2022.
- Computer methods for ordinary differential equations and differential-algebraic equations, volume 61. Siam, 1998.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Deep equilibrium models. In Neural Information Processing Systems (NeurIPS), 2019.
- Multiscale deep equilibrium models. Advances in Neural Information Processing Systems, 33:5238–5250, 2020.
- Stabilizing Equilibrium Models by Jacobian Regularization. In International Conference on Machine Learning (ICML), 2021.
- Deep equilibrium optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 620–630, 2022.
- Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. arXiv preprint arXiv:2206.07309, 2022a.
- Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022b.
- Tract: Denoising diffusion models with transitive closure time-distillation, 2023.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Charles G Broyden. A Class of Methods for Solving Nonlinear Simultaneous Equations. Mathematics of computation, 19(92):577–593, 1965.
- Emerging properties in self-supervised vision transformers. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
- Diffusion models beat gans on image synthesis. Neural Information Processing Systems (NeurIPS), 2021.
- Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
- GENIE: Higher-Order Denoising Diffusion Solvers. In Advances in Neural Information Processing Systems, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Fixed Point Networks: Implicit Depth Models with Jacobian-Free Backprop. arXiv preprint arXiv:2103.12803, 2021.
- Is Attention Better Than Matrix Decomposition? In International Conference on Learning Representations (ICLR), 2021a.
- On training implicit models. In Neural Information Processing Systems (NeurIPS), 2021b.
- Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
- Autogan: Neural architecture search for generative adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2019.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Joint inference and input optimization in equilibrium networks. Neural Information Processing Systems (NeurIPS), 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS), 2020.
- Video diffusion models. In Neural Information Processing Systems (NeurIPS), 2022.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pages 8867–8887. PMLR, 2022.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
- Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
- Generative adversarial transformers. In International Conference on Machine Learning (ICML), 2021.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
- Shap-e: Generating conditional 3d implicit functions, 2023.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Training generative adversarial networks with limited data. Advances in neural information processing systems, 33:12104–12114, 2020a.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020b.
- Elucidating the design space of diffusion-based generative models. In Neural Information Processing Systems (NeurIPS), 2022.
- Variational diffusion models. Neural Information Processing Systems (NeurIPS), 2021.
- On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132, 2021.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023.
- Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022a.
- Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022b.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations (ICLR), 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Implicit normalizing flows. arXiv preprint arXiv:2103.09527, 2021.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
- Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
- Recurrence without recurrence: Stable video landmark detection with deep equilibrium models. arXiv preprint arXiv:2304.00600, 2023.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), 2022.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning (ICML), 2020.
- PyTorch: An Imperative Style, High-performance Deep Learning Library. In Neural Information Processing Systems (NeurIPS), 2019.
- Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
- Film: Visual reasoning with a general conditioning layer. In Association for the Advancement of Artificial Intelligence (AAAI), 2018.
- Deep equilibrium approaches to diffusion models. Advances in Neural Information Processing Systems, 35:37975–37990, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Lipschitz bounded equilibrium networks. arXiv preprint arXiv:2010.01732, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems (NeurIPS), 2022b.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600, 2021.
- Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
- Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021a.
- Generative modeling by estimating gradients of the data distribution. Neural Information Processing Systems (NeurIPS), 2019.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021b.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Deep equilibrium models as estimators for continuous latent variables. In International Conference on Artificial Intelligence and Statistics, pages 1646–1671. PMLR, 2023.
- Attention is all you need. Neural Information Processing Systems (NeurIPS), 2017.
- Deep equilibrium object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6296–6306, 2023.
- What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pages 22964–22984. PMLR, 2022.
- Implicit Feature Pyramid Network for Object Detection. arXiv preprint arXiv:2012.13563, 2020.
- Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Monotone operator equilibrium networks. Advances in neural information processing systems, 33:10718–10728, 2020.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning (ICML), 2020.
- Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
- Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- An efficient deep equilibrium model for medical image segmentation. Computers in Biology and Medicine, 148:105831, 2022.
- Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022.