Improved Distribution Matching Distillation for Fast Image Synthesis (2405.14867v2)

Published 23 May 2024 in cs.CV

Abstract: Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods.

PDF Abstract

DMD2: Accelerating Diffusion Models Without Sacrificing Quality

Let's dive into recent advances in diffusion models for image generation, particularly through an approach known as DMD2. This development aims to make the generation faster by distilling diffusion models, which are known for their visual quality but notoriously slow sampling process. DMD2 makes it possible to produce high-quality images with a fraction of the computational cost.

Background on Diffusion Models and Distillation

Before we jump into the details, let's briefly touch on diffusion models and distillation. A diffusion model gradually adds noise to an image and then learns to reverse this process, denoising step-by-step to generate new images. While they produce great results, the step-by-step nature can be slow.

Distillation is a way to compress or streamline this process. It involves training a simpler, "student" model to mimic a more complex, "teacher" model. This distillation process often yields a more efficient model but can lose quality due to imperfect mimicry.

What's New With DMD2?

DMD2 addresses several issues present in traditional Distribution Matching Distillation (DMD) and other distillation methods. Here are the primary innovations:

1. Removing the Regression Loss

DMD originally required a regression loss to stabilize training, which involved constructing a large dataset of noise--image pairs. This process is computationally heavy and limits scalability. DMD2 eliminates this requirement, making the training process more efficient and flexible.

2. Two Time-Scale Update Rule

Simply removing the regression loss does introduce instability. To counter this, DMD2 employs a Two Time-Scale Update Rule where the model updates the fake score estimator more frequently than the generator. This technique ensures more stable and accurate training.

3. Integrating a GAN Loss

To improve image quality further, DMD2 integrates a Generative Adversarial Network (GAN) loss into the distillation. This GAN discriminates between real and generated images. Since the GAN is trained on real data, it helps the student model surpass the teacher's quality by compensating for any errors in the teacher's score estimation.

4. Multi-Step Generators and Simulation

DMD2 supports multi-step generation, which splits the image generation into several steps. This approach can handle larger, more complex models and produce higher resolution images. They also addressed a common issue—training and inference mismatch—by simulating the inference process during training. This ensures that the model performs consistently whether training or generating new images.

Numerical Results

The advancements in DMD2 have led to impressive results:

ImageNet-64x64: DMD2 achieved a Fréchet Inception Distance (FID) score of 1.28, surpassing many existing models and even the original teacher in some configurations.
COCO 2014 (Zero-Shot): For text-to-image synthesis, DMD2 achieved an FID of 8.35 and demonstrated scalable success with larger models like SDXL, even producing high-quality megapixel images.

Practical and Theoretical Implications

Practical Implications

Efficiency: By removing the requirement for regression loss and employing scalable techniques, DMD2 minimizes computational costs, making high-quality image generation more accessible.
Quality: Integrating GAN loss allows the student model not just to mimic but to surpass the teacher, achieving superior image quality and diversity.
Scalability: Multi-step generators and backward simulation enable handling larger models and producing high-resolution images efficiently.

Theoretical Implications

Distribution Matching: DMD2's approach solidifies the idea that it's possible to focus purely on distribution matching for high-quality results without needing regression losses tied to the teacher's pathways.
Stability and Convergence: The two time-scale update rule showcases an effective strategy to ensure stable training in diffusion-distribution matching contexts.

Future Directions

Potential future directions for this research could include:

Dynamic Guidance: Allowing for variable guidance scales during training, providing users more flexibility during inference.
Human Feedback Integration: Combining distribution matching with human feedback could further refine the output quality and user alignment.
Bias and Fairness: Ongoing work to detect and mitigate biases within generated images, ensuring fairness and inclusiveness.

Overall, DMD2 presents a significant step forward in making diffusion models more practical for everyday use, balancing efficiency and quality admirably. Keep an eye on these developments, as they promise to shape the future of image synthesis technology.