Multistep Distillation of Diffusion Models via Moment Matching (2406.04103v1)

Published 6 Jun 2024 in cs.LG, cs.AI, cs.CV, and cs.NE

Abstract: We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a multistep distillation framework that leverages moment matching to drastically reduce the inference steps required by diffusion models.
It employs alternating optimization and parameter-space moment matching, attaining state-of-the-art FID scores and improved Inception scores on ImageNet.
The approach enhances efficiency in generative tasks, enabling faster image and text-to-image synthesis for practical, real-world applications.

Multistep Distillation of Diffusion Models via Moment Matching

Introduction

Diffusion models are a class of generative models that have shown exceptional performance in generating images, video, audio, and other modalities. The key insight of diffusion models is to decompose the generation of high-dimensional outputs into an iterative denoising process. While this approach simplifies the training objective, it imposes a significant computational burden during inference, as sampling typically requires hundreds of neural network evaluations.

To address the high inference cost, the paper "Multistep Distillation of Diffusion Models via Moment Matching" presents a method to distill many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. This method extends one-step distillation methods to multiple steps, offering new insights through the framework of moment matching. Importantly, the distilled models demonstrate state-of-the-art performance on the Imagenet dataset and show promising results for text-to-image generation.

Method

The paper introduces a new distillation technique for diffusion models, focusing on minimizing the number of sampling steps required while maintaining or even improving the quality of the generated samples. Two key strategies are proposed: alternating optimization and parameter-space moment matching.

Alternating Optimization

The first approach involves alternating optimization between the generator model $g_{\eta}$ and an auxiliary denoising model $g_{\phi}$ . The procedure entails:

Sampling noisy data $\mathbf{x}_1 \sim N(0,\mathbf{I})$ .
Iterative refinement using $\hat{\mathbf{x}} = g_{\eta}(\mathbf{x}_t, t)$ and sampling less noisy data $\mathbf{x}_s \sim q(\mathbf{x}_s | \mathbf{x}_t, \hat{\mathbf{x}})$ for subsequent timesteps $s < t$ .
Auxiliary model fitting to predict $\hat{\mathbf{x}}$ and stay close to the teacher model $g_{\theta}$ .
Moment matching by minimizing the L2-distance between moments $\mathbb{E}_g [\hat{\mathbf{x}} | \mathbf{x}_s]$ and $\mathbb{E}_q [\mathbf{x} | \mathbf{x}_s]$ .

Parameter-Space Moment Matching

The second approach introduces an instantaneous variant where moment matching is performed in the parameter space rather than the data space. This method uses:

Single infinitesimal gradient descent step to determine the auxiliary model parameters.
Gradient preconditioning with a scaling matrix $\Lambda$ .
Calculating loss by matching expected teacher gradients, leading to the minimization of the loss $\tilde{L}_{\text{instant}}(\eta)$ .

Results

The efficacy of the proposed method is demonstrated through extensive experiments on the ImageNet dataset at resolutions of 64x64 and 128x128, as well as on a large text-to-image model at 512x512 resolution. Key findings include:

State-of-the-art FID scores on ImageNet, surpassing undistilled models using more than 1000 sampling steps.
Improved Inception Scores without requiring classifier-free guidance.
Superior performance of the alternating optimization approach for fewer sampling steps.
Application to a large text-to-image model, producing high-quality images quickly.

Implications

The implications of this research are significant for both theoretical and practical aspects:

Theoretical Insight: Introducing the moment matching framework for distillation provides a unifying perspective that encompasses and extends existing one-step methods. This framework elucidates the importance of conditional moment matching over merely fitting marginal distributions.
Practical Application: The proposed methods dramatically reduce the computational cost of inference for diffusion models, making them more feasible for real-world applications. This advancement opens up diffusion models for broader usage in time-sensitive applications such as real-time video generation and interactive media.
Future Directions: Future research could delve into the theoretical guarantees of multistep distillation, exploring convergence properties and the potential to further optimize sampling schedules. Additionally, human evaluations complementing automated metrics could provide richer insights into the perceptual quality of generated images.

Conclusion

"Multistep Distillation of Diffusion Models via Moment Matching" presents a robust and effective method for reducing the inference cost of diffusion models. By extending one-step distillation methods to the multistep setting with moment matching, the paper not only achieves state-of-the-art generative performance but also sets the stage for future developments in more efficient and scalable diffusion models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TimSalimans/status/1800508353560605059

https://twitter.com/William74312006/status/1881203602578174215

https://twitter.com/mctalentowen/status/1799838057140801866