Diffusion Models Beat GANs on Image Synthesis (2105.05233v4)

Published 11 May 2021 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our code at https://github.com/openai/guided-diffusion

PDF Abstract

Evaluating the Superiority of Diffusion Models Over GANs for Image Synthesis

This paper investigates the application of diffusion models to the task of image synthesis, presenting diffusion models as superior to the Generative Adversarial Networks (GANs) which have held the state-of-the-art position in this field for several years. The authors provide comprehensive evidence through rigorous experimentation and results, highlighting the improvements offered by diffusion models in specific metrics and overall image quality.

Key Findings and Methodology

Diffusion Models’ Improved Sample Quality

Unconditional Image Synthesis: The authors demonstrate that diffusion models achieve a Fréchet Inception Distance (FID) of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, surpassing GAN-based models. These results were obtained by refining the model architecture through a series of ablations that included increasing the number of attention heads and employing the BigGAN residual block for upsampling and downsampling activations. This fine-tuning led to significant improvements in FID scores, indicative of higher quality images and better distribution coverage.

Conditional Image Synthesis: Utilizing classifier guidance, which involves using gradients from a classifier to direct the sampling process of the diffusion model, further bolstered image quality. Classifier guidance enables a trade-off between image diversity and fidelity, controlling the model to sample high-fidelity images while maintaining a broader coverage of the data distribution. With classifier guidance, the model achieved an FID of 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512.

Methodological Innovations

Classifier Guidance: The usage of classifier guidance as a method for conditional image synthesis is a notable innovation in this work. The classifier, trained on noisy images, guides the diffusion model by tweaking the gradients during sampling, enabling the model to focus on desired class attributes without sacrificing sample quality. This technique effectively leverages pre-trained classifiers to impose additional control over the synthesis process, thereby enhancing the fidelity of the generated images.

Architecture Improvements: Architectural innovations include the adoption of adaptive group normalization (AdaGN) for incorporating timestep and class embeddings into each residual block. This change enhances the performance over the traditional instance norm or batch norm approaches. Additionally, the paper explores varying the depth and width of the models, increasing the number of attention heads, and employing multi-resolution attention, all of which contribute to the superior performance of diffusion models in comparison to GANs.

Numerical Results and Implications

The numerical results presented across various datasets and resolutions assert the superior performance of diffusion models. For instance, on the challenging ImageNet dataset, the diffusion model exceeded the performance of BigGAN-deep with even fewer forward passes per sample — achieving the benchmark with as few as 25 steps while still maintaining better distribution coverage.

These findings imply that diffusion models not only offer a powerful alternative to GANs but also have significant practical benefits. The improvements in FID and recall metrics showcase that diffusion models can achieve both higher fidelity and diversity, making them suitable for applications demanding high-quality image generation and extensive distribution coverage.

Future Speculations

The implications of this research suggest that future developments could involve refining the sampling speed of diffusion models, potentially bridging the sampling speed gap with GANs. Another interesting direction could be extending classifier guidance to unsupervised or semi-supervised settings, perhaps by leveraging synthetic labels or enhancing clustering techniques. Furthermore, diffusion models could be integrated with other generative frameworks, exploring the synergy between different paradigms to address complex generative tasks.

Conclusion

The paper conclusively shows that diffusion models, through architectural refinements and innovative guidance techniques, outperform GANs in generating high-fidelity images with better distribution coverage. This work shifts the paradigm of state-of-the-art image synthesis, laying the groundwork for future research in leveraging diffusion models for diverse and high-quality image generation tasks. This sets a new benchmark in image synthesis, prompting a reconsideration of the utility and scalability of traditional GAN-based approaches.