Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models (2304.12526v2)

Published 25 Apr 2023 in cs.CV and cs.LG

Abstract: Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. arXiv preprint arXiv:2206.07309, 2022a.
  2. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022b.
  3. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  4. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
  5. Any-resolution training for high-resolution image synthesis. arXiv preprint arXiv:2204.07156, 2022.
  6. Data-efficient gan training beyond (just) augmentations: A lottery ticket perspective. Advances in Neural Information Processing Systems, 34:20941–20955, 2021.
  7. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  8. Soft diffusion: Score matching for general corruptions. arXiv preprint arXiv:2209.05442, 2022.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  10. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  11. Patched denoising diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2308.01316, 2023.
  12. CARD: Classification and regression diffusion models. Advances in Neural Information Processing Systems, 35:18100–18115, 2022.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  16. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
  17. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/hyvarinen05a.html.
  18. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  19. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  21. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020.
  22. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  23. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. arXiv e-prints, pp.  arXiv–2106, 2021.
  24. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  25. Sinddm: A single image denoising diffusion model. arXiv preprint arXiv:2211.16582, 2022.
  26. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  27. Memory efficient patch-based training for inr-based gans. arXiv preprint arXiv:2207.01395, 2022.
  28. Li, S. Z. Markov random field models in computer vision. In European conference on computer vision, pp.  361–370. Springer, 1994.
  29. Li, S. Z. Markov random field modeling in image analysis. Springer Science & Business Media, 2009.
  30. Few-shot image generation with elastic weight consolidation. arXiv preprint arXiv:2012.02780, 2020.
  31. Coco-gan: Generation by parts via conditional coordinating. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4512–4521, 2019.
  32. Anycost gans for interactive image synthesis and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14986–14996, 2021.
  33. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  34. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  35. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022a.
  36. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  37. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14267–14276, 2023.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  40. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
  41. Sinfusion: Training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743, 2022.
  42. Few-shot image generation via cross-domain correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10743–10752, 2021.
  43. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599–8608. PMLR, 2021.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  46. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  47. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  49. Adversarial generation of continuous images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10753–10764, 2021.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  52. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930, 2019.
  53. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  54. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  55. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7921–7931, 2021.
  56. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  57. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  58. Sindiffusion: Learning a diffusion model from a single natural image. arXiv preprint arXiv:2211.12445, 2022a.
  59. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022b.
  60. Diffusion-GAN: Training GANs with diffusion. arXiv preprint arXiv:2206.02262, 2022c.
  61. In-context learning unlocked for diffusion models. arXiv preprint arXiv:2305.01115, 2023.
  62. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pp.  1395–1403, 2015.
  63. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  64. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  65. Sine: Single image editing with text-to-image diffusion models. arXiv preprint arXiv:2212.04489, 2022.
  66. Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33:7559–7570, 2020.
  67. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HDxgaKk956l.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhendong Wang (60 papers)
  2. Yifan Jiang (80 papers)
  3. Huangjie Zheng (34 papers)
  4. Peihao Wang (43 papers)
  5. Pengcheng He (60 papers)
  6. Zhangyang Wang (375 papers)
  7. Weizhu Chen (128 papers)
  8. Mingyuan Zhou (161 papers)
Citations (67)

Summary

  • The paper introduces a conditional score function at the patch level, significantly accelerating training and boosting data efficiency.
  • It uses randomized and progressive patch sizes with pixel-level coordinates to capture multi-scale dependencies during training.
  • Experimental results demonstrate over twofold faster training and competitive FID scores on datasets such as CelebA, FFHQ, and AFHQv2.

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

In the paper "Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models," the authors introduce Patch Diffusion, a framework aimed at improving the training efficiency of diffusion models, a subset of generative models known for their powerful capabilities but also their extensive resource requirements. The central innovation lies in a patch-based training approach that reduces computational burdens and enhances data efficiency, thereby democratizing the use and development of diffusion models within the broader research community.

Core Innovations

The technological advancement proposed in this paper is the introduction of a conditional score function that operates at the patch level. This function incorporates the patch's location in the original image using coordinate channels, with the patch size being both randomized and diversified during training to effectively encode cross-region dependencies at multiple scales. By training on patches rather than full images, the framework reduces computational costs and time consumption significantly, to the extent that it achieves over twice the speed in training while maintaining, or even improving, generation quality. Notably, the models trained using Patch Diffusion demonstrate strong performance with smaller datasets, requiring as few as 5,000 images to achieve competitive results from scratch.

Methodology

Patch Diffusion employs conditional score matching on image patches whereby both the location and size of each patch serve as conditions for training the model. This is further supported by pixel-level coordinate systems to facilitate better patch-level score matching and sampling processes that are as streamlined as original diffusion models—full coordinates are concatenated with sampled noise to easily traverse the reverse diffusion chain.

During training, the patch sizes follow a schedule that can be stochastic—randomly sampling patch sizes—or progressive, where training progresses from small patches to large ones and finally includes the full-size images. This balanced approach between patch size diversity and occasional full-image training integrates global structure encoding effectively, fostering both training efficiency and generation quality.

Results

The paper reports compelling experimental results, showcasing significant improvements in training speed and generation quality across various datasets. For CelebA and FFHQ datasets, Patch Diffusion attains remarkable FID scores, which are close to—and sometimes surpass—state-of-the-art benchmarks, despite the generally reduced training times. Moreover, when integrated into established models like ControlNet, Patch Diffusion noticeably enhances fine-tuning efficiency.

Additionally, the framework has demonstrated superior performance on limited-sized datasets like AFHQv2, illustrating improved data efficiency—a vital attribute for advancing diffusion models to smaller data environments. The technique also shows potential for image extrapolation tasks, where trained models effectively extrapolate image boundaries and maintain coherence across expanded coordinate manifolds.

Implications and Future Work

The implementation of Patch Diffusion heralds promising implications for both practical and theoretical progress in artificial intelligence. Practically, it enables more researchers to leverage diffusion models without prohibitive resource expenses, creating pathways for broader innovation within Generative AI. Theoretically, it opens up new avenues for research into patch-wise training strategies and the convergence of score functions under data augmentation methodologies.

Future work could explore enhancements to the coordinate systems, such as refined positional embeddings for better integration, as well as theoretical explorations on the convergence of patch-wise score matching in general cases. Given its efficient reduction in resource use and improved generation quality, Patch Diffusion sets a compelling precedent for employing coordinate-conditioned score matching and stochastic scheduling in generative modeling frameworks.

Github Logo Streamline Icon: https://streamlinehq.com