Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules (2407.02031v2)

Published 2 Jul 2024 in cs.DC, cs.AI, and cs.LG

Abstract: Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. NVIDIA CUDA Graphs. https://developer.nvidia.com/blog/cuda-10-features-revealed/, 2018.
  2. NVIDIA NVLink: High-speed GPU interconnect. https://www.nvidia.com/en-us/design-visualization/nvlink-bridges/, 2024.
  3. Adobe. Adobe unleashes new era of creativity for all with the commercial release of generative AI. https://news.adobe.com/news/news-details/2023/Adobe-Unleashes-New-Era-of-Creativity-for-All-With-the-Commercial-Release-of-Generative-AI/default.aspx, 2023.
  4. Adobe. Dream bigger with adobe firefly. https://www.adobe.com/sensei/generative-ai/firefly.html/, 2024.
  5. Approximate caching for efficiently serving text-to-image diffusion models. In Proc. USENIX NSDI, 2024.
  6. Proteus: A high-throughput inference-serving system with accuracy scaling. In Proc. ACM ASPLOS, 2024.
  7. PipeSwitch: Fast pipelined context switching for deep learning applications. In Proc. USENIX OSDI, 2020.
  8. John Benzi and M Damodaran. Parallel three dimensional direct simulation monte carlo for simulating micro flows. In Parallel Computational Fluid Dynamics 2007: Implementations and Experiences on Large Scale and Grid Computing. 2008.
  9. Punica: Multi-tenant LoRA serving. In Proc. MLSys, 2024.
  10. Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.
  11. Clipper: A low-latency online prediction serving system. In Proc. USENIX NSDI, 2017.
  12. Serving DNNs like Clockwork: Performance predictability from the bottom up. In Proc. USENIX OSDI, 2020.
  13. Cocktail: A multidimensional optimization for model serving in cloud. In Proc. USENIX NSDI, 2022.
  14. John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 1988.
  15. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proc. EMNLP, 2021.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NIPS, 2017.
  17. LoRA: Low-rank adaptation of large language models. In Proc. ICLR, 2022.
  18. HuggingFace. Reduce memory usage. https://huggingface.co/docs/diffusers/en/optimization/memory#cpu-offloading, 2024.
  19. DistriFusion: Distributed parallel inference for high-resolution diffusion models. In Proc. IEEE/CVF CVPR, 2024.
  20. CaraServe: CPU-assisted and rank-aware LoRA serving for generative LLM inference. arXiv preprint arXiv:2401.11240, 2024.
  21. DeepCache: Accelerating diffusion models for free. In Proc. IEEE/CVF CVPR, 2024.
  22. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  23. SpecInfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proc. ACM ASPLOS, 2024.
  24. OpenAI. DALL·E 2. https://openai.com/index/dall-e-2/, 2024.
  25. SDXL: Improving latent diffusion models for high-resolution image synthesis. In Proc. ICLR, 2024.
  26. Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
  27. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF CVPR, 2022.
  28. U-Net: Convolutional networks for biomedical image segmentation. In Proc. MICCAI, 2015.
  29. S-LoRA: Serving thousands of concurrent LoRA adapters. In Proc. MLSys, 2023.
  30. Attention is all you need. In Proc. NIPS, 2017.
  31. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  32. Morphling: Fast, near-optimal auto-configuration for cloud-native model serving. In Proc. ACM SoCC, 2021.
  33. Tabi: An efficient multi-level inference system for large language models. In Proc. ACM EuroSys, 2023.
  34. MGG: Accelerating graph neural networks with fine-grained intra-kernel communication-computation pipelining on multi-GPU platforms. In Proc. USENIX OSDI, 2023.
  35. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.
  36. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Proc. ACL, 2023.
  37. Wikipedia. Gustafson’s law. https://en.wikipedia.org/wiki/Gustafson%27s_law, 2024.
  38. INFless: A native serverless system for low-latency, high-throughput inference. In Proc. ACM ASPLOS, 2022.
  39. GRACE: A scalable graph-based approach to accelerating recommendation model inference. In Proc. ACM ASPLOS, 2023.
  40. Orca: A distributed serving system for transformer-based generative models. In Proc. USENIX OSDI, 2022.
  41. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  42. MArk: Exploiting cloud services for cost-effective, SLO-aware machine learning inference serving. In Proc. USENIX ATC, 2019.
  43. SHEPHERD: Serving DNNs in the wild. In Proc. USENIX NSDI, 2023.
  44. Adding conditional control to text-to-image diffusion models. In Proc. IEEE/CVF ICCV, 2023.
  45. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE/CVF CVPR, 2018.
  46. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Proc. NeurIPS Datasets and Benchmarks Track, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com