Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (2401.09670v3)

Published 18 Jan 2024 in cs.DC

Abstract: DistServe improves the performance of LLMs serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

Overview of 'DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving'

The academic paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving" presents a novel approach to serving LLMs by addressing inefficiencies inherent in the colocation of prefill and decoding phases within traditional systems. This work introduces DistServe, a serving system that disaggregates these phases, allowing for the optimization of goodput—defined as the maximum request rate served while satisfying service-level objective (SLO) attainment goals.

DistServe Architecture and Approach

DistServe innovatively separates the prefill and decoding phases, allocating them to different GPUs and thereby eliminating the prefill-decoding interference typically observed in co-located systems. This separation provides the flexibility to independently optimize resource allocation and parallelism strategies for each phase. DistServe can thus tailor its operations to meet specific latency requirements: time to first token (TTFT) for prefill and time per output token (TPOT) for decoding.

DistServe's architecture takes advantage of the distinct computational characteristics of prefill and decoding operations, adapting strategies to both contemporary hardware and stringent application requirements. Key to its implementation is a strategy for mapping workloads effectively across distributed systems, utilizing model parallelism—including both intra- and inter-operator techniques—while minimizing inter-GPU communication bottlenecks through bandwidth-aware placement algorithms.

Performance Evaluation

Evaluations demonstrate significant improvements in serving efficiency. Across various LLMs, DistServe is shown to handle up to 4.48 times the number of requests or accommodate 10.2 times tighter SLO constraints than existing state-of-the-art systems, all within the confines of latency requirements for over 90% of requests. This performance enhancement is attributable primarily to the reduction of task interference and optimized resource allocation enabled by disaggregation.

Through a comprehensive analysis, the paper establishes that the communication overhead introduced by disaggregation is negligible within modern GPU infrastructure, especially when considering network architectures equipped with sufficient intra-node bandwidth. Indeed, DistServe's execution strategy emphasizes efficiency in both compute-bound prefill operations and memory-bound decoding tasks, leveraging workload characteristics to determine optimal parallelism configurations.

Analytical Insights and Methodology

A significant portion of the paper focuses on the algorithmic and simulation-backed development of DistServe's design. The authors employ detailed modeling of LLM inference latency, capitalizing on predictable workload patterns to accurately simulate and optimize system configurations without extensive real-world testing. This decision-making is formalized in a two-stage placement algorithm that operates within the constraints of available hardware, optimizing parallelism through rigorous simulation rather than empirical trial and error.

The paper’s analytical framework effectively addresses potential pitfalls in scalability and execution efficiency, presenting a scalable approach that accounts for variable input lengths and heterogeneous network topologies. It provides a clear methodological advancement for deploying LLMs in environments where both computational efficiency and latency minimization are critical.

Implications and Future Work

The implications of DistServe extend beyond immediate performance gains. The flexible disaggregation strategy outlined could inform future developments in LLM service architecture, particularly as models grow in complexity and size. Moreover, this approach may inspire similar optimizations across other domains of distributed computing where task interference and resource coupling are prevalent.

The theoretical and practical insights on disaggregation and parallelism optimization could encourage further exploration into dynamic, need-based resource allocation models, potentially incorporating real-time adaptation to workload shifts. Future research might also explore integrating fault tolerance and advanced scheduling methods to mitigate the risks associated with fault propagation in disaggregated systems.

In essence, DistServe represents a significant stride toward efficient and cost-effective LLM deployment, setting a precedent for the deployment of next-generation AI systems under the increasingly ubiquitous requirements for high-throughput and low-latency service delivery.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  2. Bard, an experiment by google. https://bard.google.com/, 2023.
  3. Inflection tech memo. https://inflection.ai/assets/Inflection-1.pdf, 2023.
  4. Lanchain usecase: Summarization, 2023.
  5. Nvidia collective communications library (nccl), 2023.
  6. Sharegpt teams. https://sharegpt.com/, 2023.
  7. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
  8. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  9. A case for disaggregation of ml data processing, 2022.
  10. Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
  11. Evaluating large language models trained on code. 2021.
  12. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  14. Compute Express Link Consortium. Compute express link, 2023. Accessed: 2023-12-07.
  15. NVIDIA Corporation. Fastertransformer, 2019.
  16. NVIDIA Corporation. Triton inference server: An optimized cloud and edge inferencing solution., 2019.
  17. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  18. Turbotransformers: an efficient gpu serving system for transformer models. In ACM PPoPP, 2021.
  19. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  20. Serving DNNs like clockwork: Performance predictability from the bottom up. In USENIX OSDI, 2020.
  21. Mira: A program-behavior-guided far memory system. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 692–708, New York, NY, USA, 2023. Association for Computing Machinery.
  22. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539–558, Carlsbad, CA, July 2022. USENIX Association.
  23. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2019.
  24. Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 642–657, 2023.
  25. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  26. Efficient memory management for large language model serving with pagedattention, 2023.
  27. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. arXiv, 2023.
  28. Ray: A distributed framework for emerging AI applications. In USENIX OSDI, 2018.
  29. Pipedream: Generalized pipeline parallelism for dnn training. In ACM SOSP, 2019.
  30. OpenAI. Gpt-4 technical report, 2023.
  31. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 1–18. USENIX Association, July 2021.
  32. Zero: Memory optimizations toward training trillion parameter models, 2020.
  33. Reuters, 2023.
  34. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  35. {{\{{LegoOS}}\}}: A disseminated, distributed {{\{{OS}}\}} for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 69–87, 2018.
  36. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  37. Fundamentals of queueing theory, volume 399. John Wiley & Sons, 2018.
  38. Hotgpt: How to make software documentation more useful with a large language model? In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 87–93, 2023.
  39. Llama: Open and efficient foundation language models, 2023.
  40. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
  41. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In USENIX OSDI, 2022.
  42. Shepherd: Serving dnns in the wild. 2023.
  43. Opt: Open pre-trained transformer language models, 2022.
  44. Yiying Zhang. Make it real: An end-to-end implementation of a physically disaggregated data center. SIGOPS Oper. Syst. Rev., 57(1):1–9, jun 2023.
  45. Ft-cnn: Algorithm-based fault tolerance for convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems, 32(7):1677–1689, 2021.
  46. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In USENIX OSDI, 2022.
  47. {{\{{PetS}}\}}: A unified framework for {{\{{Parameter-Efficient}}\}} transformers serving. In USENIX ATC, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yinmin Zhong (11 papers)
  2. Shengyu Liu (5 papers)
  3. Junda Chen (14 papers)
  4. Jianbo Hu (10 papers)
  5. Yibo Zhu (31 papers)
  6. Xuanzhe Liu (59 papers)
  7. Xin Jin (285 papers)
  8. Hao Zhang (947 papers)
Citations (78)