Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EM Distillation for One-step Diffusion Models (2405.16852v2)

Published 27 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  2. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  3. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020a.
  4. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  5. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  6. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  7. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  8. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  9. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  10. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
  11. Scalable diffusion for materials generation. arXiv preprint arXiv:2311.09235, 2023.
  12. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020b.
  13. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  14. Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248, 2023.
  15. Consistency models. In International Conference on Machine Learning, pages 32211–32252. PMLR, 2023.
  16. Boot: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {{\{{\\\backslash\&}}\}} Generative Modeling, 2023.
  17. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pages 42390–42402. PMLR, 2023.
  18. Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
  19. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
  20. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
  21. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  22. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2023a.
  23. One-step diffusion with distribution matching distillation. In CVPR, 2024.
  24. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  25. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 32, 2019.
  26. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  27. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  28. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  29. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36, 2023.
  30. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  31. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  32. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  33. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644. PMLR, 2016.
  34. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  35. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9155–9164, 2018.
  36. Mcmc should mix: Learning energy-based model with neural transport latent space mcmc. In International Conference on Learning Representations, 2021.
  37. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5272–5280, 2020.
  38. Learning energy-based models by diffusion recovery likelihood. arXiv preprint arXiv:2012.08125, 2020.
  39. Christopher M Bishop. Pattern recognition and machine learning. Springer google schola, 2:1122–1128, 2006.
  40. Alternating back-propagation for generator network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  41. Swiftbrush: One-step text-to-image diffusion model with variational score distillation. arXiv preprint arXiv:2312.05239, 2023.
  42. Vaebm: A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2020.
  43. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  44. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  45. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  46. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  47. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  48. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  49. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  50. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023.
  51. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023b.
  52. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  53. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159, 2024.
  54. Perflow: Piecewise rectified flow as universal plug-and-play accelerator. arXiv preprint arXiv:2405.07510, 2024.
  55. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
  56. Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015, 2024.
  57. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024.
  58. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686, 2024.
  59. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  60. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  61. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  62. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
  63. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  64. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  65. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  66. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  67. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  68. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  69. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  70. A style-based generator architecture for generative adversarial networks. arxiv e-prints. arXiv preprint arXiv:1812.04948, 2018.
  71. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  72. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  73. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023c.
  74. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  75. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  76. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  77. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213–13232. PMLR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sirui Xie (19 papers)
  2. Zhisheng Xiao (17 papers)
  3. Diederik P Kingma (29 papers)
  4. Tingbo Hou (25 papers)
  5. Ying Nian Wu (138 papers)
  6. Kevin Patrick Murphy (2 papers)
  7. Tim Salimans (46 papers)
  8. Ben Poole (46 papers)
  9. Ruiqi Gao (44 papers)
Citations (14)

Summary

Expectation-Maximization Distillation for Diffusion Models: Efficient One-Step Generation

The paper presents a novel methodology called EM Distillation (EMD) aimed at reducing the computational overhead associated with sampling from diffusion models. Despite diffusion models' demonstrated capability to generate high-quality images, their iterative sampling process is exceedingly resource-intensive. Current distillation techniques, although beneficial, frequently falter when the number of sampling steps is minimized. These methods often experience a degradation in performance, rely excessively on training data, or fail to capture the full data distribution due to mode-seeking optimization biases.

Proposed Method: EM Distillation (EMD)

The proposed EMD method stands out by utilizing an Expectation-Maximization (EM) framework to convert a diffusion model into a one-step generator model with minimal perceptual degradation. This technique entails updating the generator's parameters using samples derived from the joint distribution of the diffusion teacher's prior and the inferred generator latents. The authors introduce an innovative reparametrized sampling scheme and a noise cancellation approach to enhance the stability of the distillation process. It is noteworthy that EMD establishes a link between their method and existing mode-seeking KL minimization strategies.

Mechanism

EMD commences with a diffusion model—functioning as the teacher—that employs a forward process transforming the complex data distribution into a Gaussian distribution, which is later reversed to generate data via solving an SDE or an equivalent ODE. EMD's core contribution lies in framing this reverse transformation through EM:

  • E-step (Expectation Step): Calculates the learning gradients using Monte Carlo samples to estimate an inferred latent context.
  • M-step (Maximization Step): Updates the generator parameters through gradient ascent on the calculated expectations.

A reparametrized sampling scheme coupled with noise cancellation is developed to ensure stability and performance, especially critical when dealing with various noise levels in diffusion models. This reparametrization simplifies hyperparameter tuning and improves short-run MCMC (Markov Chain Monte Carlo) performance.

Numerical Results and Empirical Validation

EMD's efficacy is underscored by numerical results outperforming existing one-step generation methods across multiple benchmarks. The paper reports impressive FID scores of 2.20 on ImageNet-64 and 6.0 on ImageNet-128, along with competitive results on one-step text-to-image generation tasks using distillation from Stable Diffusion models.

Comparative Analysis

When juxtaposed with traditional trajectory distillation techniques and distribution matching approaches, EMD excels particularly in the one-step sampling regime.

  • Trajectory Distillation: These methods, while reducing sampling steps, struggle in the one-step regime due to their inherent design to progressively solve differential equations correlating with the forward diffusion process.
  • Distribution Matching: Though allowing for arbitrary generators and often producing compelling results, these strategies tend to collapse modes owing to their minimization of divergences focusing selectively on the most likely modes.

Theoretical and Practical Implications

Theoretically, EMD leverages EM in an innovative fashion to stabilize and enhance diffusion model distillation, pushing the boundaries of efficiency in generative models. Practically, it democratizes real-time generation applications by significantly reducing computational requirements.

Future Directions

The paper opens up numerous avenues for future research:

  1. From-Scratch Training: Investigation into achieving competitive performance with randomly initialized generator networks, potentially without model architecture constraints.
  2. Optimization Paradigms: Enhanced strategies for tuning MCMC sampling schemes to pare down training costs while preserving performance, thus refining the computational trade-offs.
  3. Architectural Flexibility: Exploration into more diverse generator architectures which may better capture data distributions across varied domains.

Conclusion

EMD emerges as a substantial advancement in the field of diffusion models, offering a methodologically robust and computationally efficient solution to the inherent complexity of iterative sampling. This research not only addresses crucial bottlenecks in performance and stability but sets the stage for future innovations in generative AI, paving the way for broader real-time application of these models. The paper's blend of empirical success and theoretical elegance marks a significant step forward in the ongoing refinement and application of generative models in AI.