Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion (2410.19324v1)

Published 25 Oct 2024 in cs.CV, cs.LG, and stat.ML

Abstract: Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128 and ImageNet256. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at high resolution with fewer parameters, rather than using more parameters but at a lower resolution. When combining these three steps with recently proposed tricks like guidance intervals, we obtain a family of pixel-space diffusion models we call Simple Diffusion v2 (SiD2).

Overview of "Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with Pixel-Space Diffusion"

This paper challenges common perceptions regarding the efficiency and quality of latent diffusion models in high-resolution image synthesis by proposing an enhanced method for training end-to-end pixel-space diffusion models. The authors introduce a novel approach, resulting in a significant improvement over existing pixel-space models, achieving notable results such as a 1.5 FID on ImageNet512 and setting new state-of-the-art results on ImageNet128 and ImageNet256.

Key Contributions

The authors present three primary innovations:

  1. Sigmoid Loss Function with Tuned Hyperparameters: By revisiting and refining the sigmoid loss from previous work, the authors demonstrate that pixel-space models can achieve improved performance compared to EDM-monotonic weightings, especially when balancing the shift of the sigmoid function with the resolution of the images being processed.
  2. Flop Heavy Model Scaling: This involves reducing the patching size of the input rather than expanding the model's parameters or processing at lower resolutions. This approach ensures that the model is more computation-heavy rather than parameter-heavy, improving regularization, and allows for efficient fine-tuning from smaller resolutions without additional parameters.
  3. Simplified Residual U-ViTs Architecture: By removing blockwise skip-connections and replacing them with single residual connections for each downsampling operation, the model simplifies its architecture and reduces memory consumption without sacrificing performance. This is particularly beneficial in larger models where skip-connections are less crucial.

Results and Comparisons

In terms of performance, SiD2 has surpassed other models in specific image resolutions. For ImageNet128 and ImageNet256, it achieves state-of-the-art FID scores, while on ImageNet512, it remains competitive with the best latent diffusion models like EDM2. The SiD2 model significantly reduces training computational requirements compared to its predecessors, while maintaining a high quality in image generation.

Implications and Future Directions

The implications of this work are twofold. Practically, it demonstrates that end-to-end pixel-space diffusion models can rival latent models in terms of quality and efficiency. This could alleviate the need for separate autoencoder training in many applications, facilitating a more streamlined approach to diffusion model training.

Theoretically, the work highlights that significant gains can be achieved by re-examining and simplifying existing model architectures and loss functions. One potential future direction could involve further exploration of the interactions between model architecture choices and loss functions to discover new paths for reducing computational overhead while maintaining or improving model accuracy and efficiency in high-resolution settings.

Overall, this research points towards a promising avenue in the pursuit of more efficient and high-quality diffusion models without relying heavily on latent variable architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
  2. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
  3. Ting Chen. On the importance of noise scheduling for diffusion models. arxiv, 2023.
  4. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  6. f-dm: A multi-stage diffusion model via progressive signal transformation. CoRR, abs/2210.04955, 2022.
  7. Matryoshka diffusion models. CoRR, abs/2310.15111, 2023.
  8. Diffit: Diffusion vision transformers for image generation. CoRR, abs/2312.02139, 2023. doi: 10.48550/ARXIV.2312.02139. URL https://doi.org/10.48550/arXiv.2312.02139.
  9. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, 2020.
  10. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
  11. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pp. 13213–13232. PMLR, 2023.
  12. Scalelong: Towards more stable training of diffusion model via scaling network long skip connection. In NeurIPS, 2023.
  13. Scalable adaptive computation for iterative generation. CoRR, abs/2212.11972, 2022.
  14. Scedit: Efficient and controllable image diffusion generation via skip connection editing. Technical Report 2312.11392, arXiv, 2023.
  15. Distribution augmentation for generative modeling. In Proceedings of the 37th International Conference on Machine Learning, ICML, 2020.
  16. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS, 2022.
  17. Analyzing and improving the training dynamics of diffusion models. CoRR, abs/2312.02696, 2023.
  18. Guiding a diffusion model with a bad version of itself. CoRR, abs/2406.02507, 2024.
  19. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=ymjI8feDTD.
  20. Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher. CoRR, abs/2405.14822, 2024b.
  21. Understanding the diffusion objective as a weighted integral of elbos. CoRR, abs/2303.00848, 2023.
  22. Variational diffusion models. CoRR, abs/2107.00630, 2021.
  23. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. CoRR, abs/2404.07724, 2024.
  24. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda. OpenReview.net, 2023.
  25. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  26. The surprising effectiveness of skip-tuning in diffusion sampling. In Proceedings of the 41st International Conference on Machine Learning, pp.  34053–34074, 2024. URL https://proceedings.mlr.press/v235/ma24r.html.
  27. Scalable diffusion models with transformers. CoRR, abs/2212.09748, 2022.
  28. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10674–10685. IEEE, 2022.
  29. U-net: Convolutional networks for biomedical image segmentation. Technical report, ArXiV, 2015.
  30. Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487, 2022.
  31. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR. OpenReview.net, 2022.
  32. Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024.
  33. Stylegan-xl: Scaling stylegan to large diverse datasets. In Munkhtsetseg Nandigjav, Niloy J. Mitra, and Aaron Hertzmann (eds.), SIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp.  49:1–49:10. ACM, 2022.
  34. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2016.
  35. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  36. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
  37. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
  38. Disco-diff: Enhancing continuous diffusion models with discrete latents. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
  39. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023.
  40. Improved distribution matching distillation for fast image synthesis. CoRR, abs/2405.14867, 2024.
  41. Scaling autoregressive models for content-rich text-to-image generation. CoRR, abs/2206.10789, 2022.
  42. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, ICLR. OpenReview.net, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Emiel Hoogeboom (26 papers)
  2. Thomas Mensink (30 papers)
  3. Jonathan Heek (13 papers)
  4. Kay Lamerigts (2 papers)
  5. Ruiqi Gao (44 papers)
  6. Tim Salimans (46 papers)