- The paper introduces a multistep distillation framework that leverages moment matching to drastically reduce the inference steps required by diffusion models.
- It employs alternating optimization and parameter-space moment matching, attaining state-of-the-art FID scores and improved Inception scores on ImageNet.
- The approach enhances efficiency in generative tasks, enabling faster image and text-to-image synthesis for practical, real-world applications.
Multistep Distillation of Diffusion Models via Moment Matching
Introduction
Diffusion models are a class of generative models that have shown exceptional performance in generating images, video, audio, and other modalities. The key insight of diffusion models is to decompose the generation of high-dimensional outputs into an iterative denoising process. While this approach simplifies the training objective, it imposes a significant computational burden during inference, as sampling typically requires hundreds of neural network evaluations.
To address the high inference cost, the paper "Multistep Distillation of Diffusion Models via Moment Matching" presents a method to distill many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. This method extends one-step distillation methods to multiple steps, offering new insights through the framework of moment matching. Importantly, the distilled models demonstrate state-of-the-art performance on the Imagenet dataset and show promising results for text-to-image generation.
Method
The paper introduces a new distillation technique for diffusion models, focusing on minimizing the number of sampling steps required while maintaining or even improving the quality of the generated samples. Two key strategies are proposed: alternating optimization and parameter-space moment matching.
Alternating Optimization
The first approach involves alternating optimization between the generator model gη and an auxiliary denoising model gϕ. The procedure entails:
- Sampling noisy data x1∼N(0,I).
- Iterative refinement using x^=gη(xt,t) and sampling less noisy data xs∼q(xs∣xt,x^) for subsequent timesteps s<t.
- Auxiliary model fitting to predict x^ and stay close to the teacher model gθ.
- Moment matching by minimizing the L2-distance between moments Eg[x^∣xs] and Eq[x∣xs].
Parameter-Space Moment Matching
The second approach introduces an instantaneous variant where moment matching is performed in the parameter space rather than the data space. This method uses:
- Single infinitesimal gradient descent step to determine the auxiliary model parameters.
- Gradient preconditioning with a scaling matrix Λ.
- Calculating loss by matching expected teacher gradients, leading to the minimization of the loss L~instant(η).
Results
The efficacy of the proposed method is demonstrated through extensive experiments on the ImageNet dataset at resolutions of 64x64 and 128x128, as well as on a large text-to-image model at 512x512 resolution. Key findings include:
- State-of-the-art FID scores on ImageNet, surpassing undistilled models using more than 1000 sampling steps.
- Improved Inception Scores without requiring classifier-free guidance.
- Superior performance of the alternating optimization approach for fewer sampling steps.
- Application to a large text-to-image model, producing high-quality images quickly.
Implications
The implications of this research are significant for both theoretical and practical aspects:
- Theoretical Insight: Introducing the moment matching framework for distillation provides a unifying perspective that encompasses and extends existing one-step methods. This framework elucidates the importance of conditional moment matching over merely fitting marginal distributions.
- Practical Application: The proposed methods dramatically reduce the computational cost of inference for diffusion models, making them more feasible for real-world applications. This advancement opens up diffusion models for broader usage in time-sensitive applications such as real-time video generation and interactive media.
- Future Directions: Future research could delve into the theoretical guarantees of multistep distillation, exploring convergence properties and the potential to further optimize sampling schedules. Additionally, human evaluations complementing automated metrics could provide richer insights into the perceptual quality of generated images.
Conclusion
"Multistep Distillation of Diffusion Models via Moment Matching" presents a robust and effective method for reducing the inference cost of diffusion models. By extending one-step distillation methods to the multistep setting with moment matching, the paper not only achieves state-of-the-art generative performance but also sets the stage for future developments in more efficient and scalable diffusion models.