DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (2401.09670v3)
Abstract: DistServe improves the performance of LLMs serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.
- Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- Bard, an experiment by google. https://bard.google.com/, 2023.
- Inflection tech memo. https://inflection.ai/assets/Inflection-1.pdf, 2023.
- Lanchain usecase: Summarization, 2023.
- Nvidia collective communications library (nccl), 2023.
- Sharegpt teams. https://sharegpt.com/, 2023.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- A case for disaggregation of ml data processing, 2022.
- Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
- Evaluating large language models trained on code. 2021.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Compute Express Link Consortium. Compute express link, 2023. Accessed: 2023-12-07.
- NVIDIA Corporation. Fastertransformer, 2019.
- NVIDIA Corporation. Triton inference server: An optimized cloud and edge inferencing solution., 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Turbotransformers: an efficient gpu serving system for transformer models. In ACM PPoPP, 2021.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In USENIX OSDI, 2020.
- Mira: A program-behavior-guided far memory system. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 692–708, New York, NY, USA, 2023. Association for Computing Machinery.
- Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539–558, Carlsbad, CA, July 2022. USENIX Association.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2019.
- Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 642–657, 2023.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Efficient memory management for large language model serving with pagedattention, 2023.
- Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. arXiv, 2023.
- Ray: A distributed framework for emerging AI applications. In USENIX OSDI, 2018.
- Pipedream: Generalized pipeline parallelism for dnn training. In ACM SOSP, 2019.
- OpenAI. Gpt-4 technical report, 2023.
- Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 1–18. USENIX Association, July 2021.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- Reuters, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- {{\{{LegoOS}}\}}: A disseminated, distributed {{\{{OS}}\}} for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 69–87, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Fundamentals of queueing theory, volume 399. John Wiley & Sons, 2018.
- Hotgpt: How to make software documentation more useful with a large language model? In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 87–93, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In USENIX OSDI, 2022.
- Shepherd: Serving dnns in the wild. 2023.
- Opt: Open pre-trained transformer language models, 2022.
- Yiying Zhang. Make it real: An end-to-end implementation of a physically disaggregated data center. SIGOPS Oper. Syst. Rev., 57(1):1–9, jun 2023.
- Ft-cnn: Algorithm-based fault tolerance for convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems, 32(7):1677–1689, 2021.
- Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In USENIX OSDI, 2022.
- {{\{{PetS}}\}}: A unified framework for {{\{{Parameter-Efficient}}\}} transformers serving. In USENIX ATC, 2022.
- Yinmin Zhong (11 papers)
- Shengyu Liu (5 papers)
- Junda Chen (14 papers)
- Jianbo Hu (10 papers)
- Yibo Zhu (31 papers)
- Xuanzhe Liu (59 papers)
- Xin Jin (285 papers)
- Hao Zhang (948 papers)