DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models (2402.19481v4)
Abstract: Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.
- NVIDIA/TensorRT. 2023.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2023.
- {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In OSDI, 2018.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- More is less: A more complicated network with less inference complexity. In CVPR, 2017.
- Generative adversarial nets. NeurIPS, 2014.
- Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
- Learning both weights and connections for efficient neural network. NeurIPS, 2015.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. NeurIPS, 2019.
- Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
- Beyond data and model parallelism for deep neural networks. MLSys, 2019.
- Cnvlutin2: Ineffectual-activation-and-weight-free deep neural network computing. arXiv preprint arXiv:1705.00125, 2017.
- Elucidating the design space of diffusion-based generative models. NeurIPS, 2022.
- Diffusionclip: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021.
- On fast sampling of diffusion probabilistic models. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
- Pruning filters for efficient convnets. ICLR, 2016.
- Gan compression: Efficient architectures for interactive conditional gans. In CVPR, 2020.
- Efficient spatially sparse inference for conditional gans and diffusion models. In NeurIPS, 2022.
- Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, 2017.
- Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023a.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. NeurIPS, 2023b.
- Terapipe: Token-level pipeline parallelism for training large-scale language models. ICML, 2021.
- Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. USENIX Symposium on Operating Systems Design and Implementation, 2023c.
- Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Sparse convolutional neural networks. In CVPR, 2015.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022a.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv: 2310.04378, 2023a.
- Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv: 2311.05556, 2023b.
- On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022a.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022b.
- Pipedream: Generalized pipeline parallelism for dnn training. In SOSP, 2019.
- Efficient large-scale language model training on gpu clusters using megatron-lm. International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Recurrent residual module for fast inference in videos. In CVPR, 2018.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
- On aliased resizing and surprising subtleties in GAN evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11400–11410. IEEE, 2022.
- Pytorch: an imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
- Zero: Memory optimizations toward training trillion parameter models. Sc20: International Conference For High Performance Computing, Networking, Storage And Analysis, 2019.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pages 551–564. USENIX Association, 2021.
- Sbnet: Sparse blocks network for fast inference. In CVPR, 2018.
- Octnet: Learning deep 3d representations at high resolutions. In CVPR, 2017.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2021.
- Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv preprint arXiv:1704.07724, 2017.
- Parallel sampling of diffusion models. NeurIPS, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2020a.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2020b.
- Consistency models. 2023.
- Torchsparse: Efficient point cloud inference engine. In MLSys, 2022.
- Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. In MICRO, 2023.
- Score-based generative modeling in latent space. 34:11287–11302, 2021.
- Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990.
- Group normalization. In ECCV, 2018.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv, 2023.
- Tackling the generative learning trilemma with denoising diffusion GANs. In ICLR, 2022.
- Gspmd: General and scalable parallelization for ml computation graphs. arXiv preprint arXiv: 2105.04663, 2021.
- Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv: 2110.15032, 2021.
- Fast sampling of diffusion models with exponential integrator. In ICLR, 2022.
- gddim: Generalized denoising diffusion implicit models. 2022.
- Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.