Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models (2402.19481v4)

Published 29 Feb 2024 in cs.CV

Abstract: Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. NVIDIA/TensorRT. 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2023.
  4. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In OSDI, 2018.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  6. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  7. More is less: A more complicated network with less inference complexity. In CVPR, 2017.
  8. Generative adversarial nets. NeurIPS, 2014.
  9. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  10. Learning both weights and connections for efficient neural network. NeurIPS, 2015.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  12. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  13. Denoising diffusion probabilistic models. NeurIPS, 2020.
  14. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  15. Gpipe: Efficient training of giant neural networks using pipeline parallelism. NeurIPS, 2019.
  16. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
  17. Beyond data and model parallelism for deep neural networks. MLSys, 2019.
  18. Cnvlutin2: Ineffectual-activation-and-weight-free deep neural network computing. arXiv preprint arXiv:1705.00125, 2017.
  19. Elucidating the design space of diffusion-based generative models. NeurIPS, 2022.
  20. Diffusionclip: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021.
  21. On fast sampling of diffusion probabilistic models. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  22. Pruning filters for efficient convnets. ICLR, 2016.
  23. Gan compression: Efficient architectures for interactive conditional gans. In CVPR, 2020.
  24. Efficient spatially sparse inference for conditional gans and diffusion models. In NeurIPS, 2022.
  25. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, 2017.
  26. Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023a.
  27. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. NeurIPS, 2023b.
  28. Terapipe: Token-level pipeline parallelism for training large-scale language models. ICML, 2021.
  29. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. USENIX Symposium on Operating Systems Design and Implementation, 2023c.
  30. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  32. Sparse convolutional neural networks. In CVPR, 2015.
  33. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022a.
  34. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  35. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv: 2310.04378, 2023a.
  36. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv: 2311.05556, 2023b.
  37. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022a.
  38. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022b.
  39. Pipedream: Generalized pipeline parallelism for dnn training. In SOSP, 2019.
  40. Efficient large-scale language model training on gpu clusters using megatron-lm. International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
  41. Improved denoising diffusion probabilistic models. In ICML, 2021.
  42. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  43. Recurrent residual module for fast inference in videos. In CVPR, 2018.
  44. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  45. On aliased resizing and surprising subtleties in GAN evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11400–11410. IEEE, 2022.
  46. Pytorch: an imperative style, high-performance deep learning library. In NeurIPS, 2019.
  47. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
  48. Zero: Memory optimizations toward training trillion parameter models. Sc20: International Conference For High Performance Computing, Networking, Storage And Analysis, 2019.
  49. Zero-shot text-to-image generation. In ICML, 2021.
  50. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  51. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  52. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pages 551–564. USENIX Association, 2021.
  53. Sbnet: Sparse blocks network for fast inference. In CVPR, 2018.
  54. Octnet: Learning deep 3d representations at high resolutions. In CVPR, 2017.
  55. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  56. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  57. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  58. Progressive distillation for fast sampling of diffusion models. In ICLR, 2021.
  59. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv preprint arXiv:1704.07724, 2017.
  60. Parallel sampling of diffusion models. NeurIPS, 2023.
  61. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  62. Denoising diffusion implicit models. In ICLR, 2020a.
  63. Score-based generative modeling through stochastic differential equations. In ICLR, 2020b.
  64. Consistency models. 2023.
  65. Torchsparse: Efficient point cloud inference engine. In MLSys, 2022.
  66. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. In MICRO, 2023.
  67. Score-based generative modeling in latent space. 34:11287–11302, 2021.
  68. Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990.
  69. Group normalization. In ECCV, 2018.
  70. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv, 2023.
  71. Tackling the generative learning trilemma with denoising diffusion GANs. In ICLR, 2022.
  72. Gspmd: General and scalable parallelization for ml computation graphs. arXiv preprint arXiv: 2105.04663, 2021.
  73. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv: 2110.15032, 2021.
  74. Fast sampling of diffusion models with exponential integrator. In ICLR, 2022.
  75. gddim: Generalized denoising diffusion implicit models. 2022.
  76. Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023.
  77. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  78. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  79. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
Citations (27)

Summary

  • The paper introduces DistriFusion, a novel multi-GPU parallel inference method that uses displaced patch parallelism to reduce latency in high-resolution diffusion models.
  • The paper demonstrates up to 6.1x speedup on 8 NVIDIA A100 GPUs, achieving efficient synthesis of high-quality images up to 3840×3840 pixels.
  • The paper refines diffusion model acceleration with sparse operations and corrected asynchronous GroupNorm, ensuring minimal quality loss during asynchronous computations.

Accelerating High-Resolution Diffusion Models with DistriFusion: A Multi-GPU Parallel Inference Approach

Introduction to DistriFusion

The development and deployment of Diffusion models for synthesizing high-quality images have been remarkable achievements within the field of AI-generated content (AIGC). These models are central to various applications, enabling the generation of photorealistic images from textual descriptions. Despite their success, one of the primary obstacles faced by current diffusion models is the significant computational cost associated with the generation of high-resolution images, which limits usability for interactive applications. Addressing this challenge, we introduce DistriFusion, a novel method designed to reduce the latency of generating high-resolution images by leveraging parallelism across multiple GPUs.

Problem Statement

Generating high-resolution images using diffusion models involves substantial computational costs, making real-time applications virtually unfeasible. Current acceleration efforts either focus on reducing the number of sampling steps or optimizing neural network inferences, both of which have limitations. Specifically, when aiming to utilize multiple GPUs, existing methods either incur significant communication overhead or fail to utilize GPU resources efficiently, making them unsuitable for accelerating single-sample generation.

DistriFusion Approach

DistriFusion encapsulates our proposed solution, employing distributed parallel inference to tackle the computational hurdles of diffusion models. The cornerstone of DistriFusion is the innovative use of displaced patch parallelism, resting on the observation that inputs across adjacent denoising steps exhibit high similarity. This approach enables asynchronous communication that can be pipelined by computation, markedly reducing latency without compromising image quality.

Key Features of DistriFusion include:

  • Patch Parallelism: By dividing the model input into multiple patches and assigning each patch to a different GPU, DistriFusion allows for parallel operations across devices.
  • Activation Displacement: Utilizing slightly outdated, or "stale," activations from previous steps to facilitate inter-patch interactions, thereby minimizing the need for real-time communication between GPUs.
  • Sparse Operations and Corrected Asynchronous GroupNorm: To further optimize performance, DistriFusion modifies the operation of convolutional, linear, and attention layers to operate selectively on fresh areas of each patch. It also introduces a correction term for stale GroupNorm statistics, mitigating the degradation of image quality due to asynchronous operations.

Experimental Results

DistriFusion was evaluated using the Stable Diffusion XL model across various settings. The method demonstrated the capability to generate high-quality images with no observable degradation in visual fidelity compared to the original model. Notably, DistriFusion achieved speedups of up to 6.1x on 8 NVIDIA A100 GPUs compared to single-GPU operation. Furthermore, when tested on high-resolution image synthesis (up to 3840×3840 pixels), it maintained considerable speed improvements, showcasing its scalability and efficiency.

Practical Implications and Future Directions

With its robust performance, DistriFusion presents a significant advancement in the field of AI-generated content, particularly in applications demanding high-resolution image outputs. Its ability to substantially reduce the time required for image synthesis without affecting quality makes it a promising tool for real-time interactive applications, such as advanced image editing and video generation platforms.

Looking ahead, further exploration into methods for reducing communication overhead and enhancing device utilization could yield even greater efficiencies. Additionally, exploring the integration of advanced compilation techniques and expanding support for an even broader range of diffusion models and applications represent promising avenues for future research.

Conclusion

DistriFusion represents a significant step forward in addressing the computational challenges of high-resolution image generation with diffusion models. By harnessing the power of multi-GPU parallelism and introducing specialized optimizations, it opens new possibilities for the creation and interactive manipulation of AI-generated content, pushing the boundaries of what is achievable in the field.

Youtube Logo Streamline Icon: https://streamlinehq.com