Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism (2411.01738v1)

Published 4 Nov 2024 in cs.DC and cs.AI

Abstract: Diffusion models are pivotal for generating high-quality images and videos. Inspired by the success of OpenAI's Sora, the backbone of diffusion models is evolving from U-Net to Transformer, known as Diffusion Transformers (DiTs). However, generating high-quality content necessitates longer sequence lengths, exponentially increasing the computation required for the attention mechanism, and escalating DiTs inference latency. Parallel inference is essential for real-time DiTs deployments, but relying on a single parallel method is impractical due to poor scalability at large scales. This paper introduces xDiT, a comprehensive parallel inference engine for DiTs. After thoroughly investigating existing DiTs parallel approaches, xDiT chooses Sequence Parallel (SP) and PipeFusion, a novel Patch-level Pipeline Parallel method, as intra-image parallel strategies, alongside CFG parallel for inter-image parallelism. xDiT can flexibly combine these parallel approaches in a hybrid manner, offering a robust and scalable solution. Experimental results on two 8xL40 GPUs (PCIe) nodes interconnected by Ethernet and an 8xA100 (NVLink) node showcase xDiT's exceptional scalability across five state-of-the-art DiTs. Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters. xDiT is available at https://github.com/xdit-project/xDiT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Mochi1 preview guide. https://www.mochi1preview.com/guide, 2024. Accessed: 2024-10.
  2. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023.
  3. A study on the evaluation of generative models. arXiv preprint arXiv:2206.10935, 2022.
  4. BlackForestLabs. Announcing black forest labs. https://blackforestlabs.ai/announcing-black-forest-labs/, 2024. Accessed: [2024.10].
  5. Pixart-\\\backslash\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024.
  6. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  7. \\\backslash\delta-dit: A training-free acceleration method tailored for diffusion transformers. arXiv preprint arXiv:2406.01125, 2024.
  8. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  9. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  10. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  11. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. arXiv preprint arXiv:2405.14430, 2024.
  12. A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024.
  13. Scaling diffusion transformers to 16 billion parameters. arXiv preprint arXiv:2407.11633, 2024.
  14. Loongtrain: Efficient training of long-sequence llms with head-context parallelism. arXiv preprint arXiv:2406.18485, 2024.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  17. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
  18. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  19. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:36652–36663, 2023.
  20. Kuaishou Technology. Kling video model website. https://kling.kuaishou.com/en, 2024. Accessed: October, 2024.
  21. Han Lab. Patch-conv. https://hanlab.mit.edu/blog/patch-conv, 2023. Accessed: [June, 2024].
  22. Distrifusion: Distributed parallel inference for high-resolution diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  23. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024.
  24. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021.
  25. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024.
  26. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  27. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  28. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  29. Learning-to-cache: Accelerating diffusion transformer via layer caching. arXiv preprint arXiv:2406.01733, 2024.
  30. Deepcache: Accelerating diffusion models for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762–15772, 2024.
  31. MetaAI. Movie gen: A cast of media foundation models. https://ai.meta.com/static-resource/movie-gen-research-paper, 2024. Accessed: 2024-10-14.
  32. OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. Accessed: May 2024.
  33. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
  34. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  35. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  36. Mooncake: Kimi’s kvcache-centric architecture for llm serving. arXiv preprint arXiv:2407.00079, 2024.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  38. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  39. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  40. Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models. CoRR, abs/2312.03517, 2023.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  42. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  43. vllm project. vllm. https://github.com/vllm-project/vllm, 2024.
  44. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
  45. Ditfastattn: Attention compression for diffusion transformer models. arXiv preprint arXiv:2406.08552, 2024.
  46. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024.
Citations (3)

Summary

  • The paper introduces xDiT, a hybrid parallel inference engine that addresses the quadratic computational challenges in Diffusion Transformers.
  • xDiT integrates Sequence Parallelism, PipeFusion, CFG parallelism, and patch-level VAE strategies to enhance scalability and reduce communication overhead.
  • Evaluations on diverse GPU clusters demonstrate xDiT’s ability to lower latency and efficiently handle high-resolution tasks, enabling real-time diffusion model applications.

An In-Depth Analysis of xDiT: A Scalable Inference Engine for Diffusion Transformers

The rapid evolution of diffusion models, particularly with the shift toward employing Diffusion Transformers (DiTs), has introduced complex challenges in computational scalability. The paper "xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism" provides a comprehensive solution designed to address these challenges. This paper delineates the development and evaluation of xDiT, a novel parallel inference engine explicitly tailored for DiTs.

Summary and Methodological Innovations

The authors identify the essential requirement for parallel inference when working with DiTs, primarily due to the quadratic scaling of computational demands associated with attention mechanisms in such models. The xDiT system aims to address this requirement by combining various parallelism strategies to enhance scalability and computation efficiency.

  1. Diffusion Model Transition and Challenges: Diffusion models have traditionally employed U-Net architectures, but the transition to DiTs has been driven by DiTs' superior capacity and scalability. However, this transition has necessitated handling longer sequence lengths, resulting in exponential growth in computational demands and inference latency. Single-method parallel approaches have demonstrated limited efficiency in meeting these demands.
  2. Parallel Strategies in xDiT: The system leverages a hybrid approach by integrating different parallel strategies:
    • Sequence Parallel (SP) and PipeFusion: For intra-image parallelism, the paper adapts SP to DiT blocks and introduces PipeFusion, a patch-level pipeline parallelism. PipeFusion primarily outperforms other methods in communication and memory efficiency by exploiting temporal input redundancy.
    • CFG Parallelism: This addresses inter-image parallelism by separating computation paths for different conditional latents, minimizing communication overhead using AllGather operations.
    • Patch-Level VAE Parallelism: Deployed to mitigate GPU memory limitations when generating high-resolution images, it allows the VAE module to effectively manage higher memory footprints.
  3. Hybrid Parallel Approach: The innovation in xDiT lies in its flexible hybridization of these parallel approaches, which proves critical in heterogeneous network environments across different hardware configurations. This allows for the efficient distribution of computational workloads, adapting dynamically to network topologies and hardware capabilities.

Performance and Scalability

The xDiT system has been meticulously evaluated on diverse GPU cluster configurations, demonstrating exceptional scalability and efficiency across different image and video generation DiTs. The authors highlight key findings:

  • The combination of PipeFusion and SP yields lower latency and enhanced scalability, particularly notable in environments constrained by communication overhead, such as multi-node GPU clusters connected via Ethernet.
  • Hybrid parallelism, incorporating CFG parallel and PipeFusion, achieves significant performance gains, especially in contexts with diverse model architectures and varying input lengths.
  • The implementation of patch-level parallelism in the VAE module effectively circumvents potential OOM issues, allowing for the successful handling of extensive image resolutions and contributing to the robustness of the system.

Implications and Future Prospects

The development of xDiT holds substantial implications for the deployment of DiTs in real-time applications, ensuring scalability and efficiency in extensive framework settings. Practically, it provides an adaptable solution for researchers and practitioners working on high-resolution image and video generation or similar computationally demanding tasks.

From a theoretical standpoint, xDiT exemplifies the potential of marrying various parallel methodologies to harness the benefits and flexibility required in large-scale AI deployments. This hybrid approach paves the way for future research exploring dynamic parallelism, potentially integrating more advanced scheduling techniques to further minimize latency and optimize resource allocation across heterogeneous systems.

Conclusion

In summary, the "xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism" paper offers a profound methodological contribution to the field of computational models. The insights and results presented affirm the viability of hybrid parallel paradigms in catering to the demands of next-generation diffusion transformers and lay a foundation for future exploration in massive parallelism and system optimization strategies within the AI domain.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube