DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (2401.09670v3)

Published 18 Jan 2024 in cs.DC

Abstract: DistServe improves the performance of LLMs serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

References (47)

Authors (8)

Yinmin Zhong (11 papers)
Shengyu Liu (5 papers)
Junda Chen (14 papers)
Jianbo Hu (10 papers)
Yibo Zhu (31 papers)
Xuanzhe Liu (59 papers)
Xin Jin (285 papers)
Hao Zhang (948 papers)

Citations (78)

View on Semantic Scholar

Summary

The paper introduces a novel disaggregation of prefill and decoding phases to reduce interference and enhance GPU resource allocation for LLM serving.
It employs a two-stage, bandwidth-aware placement algorithm that boosts request handling by up to 4.48x and achieves 10.2x tighter SLO constraints.
Results show that DistServe meets latency targets for over 90% of requests, setting a scalable and efficient precedent for future LLM deployments.

Overview of 'DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving'

The academic paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving" presents a novel approach to serving LLMs by addressing inefficiencies inherent in the colocation of prefill and decoding phases within traditional systems. This work introduces DistServe, a serving system that disaggregates these phases, allowing for the optimization of goodput—defined as the maximum request rate served while satisfying service-level objective (SLO) attainment goals.

DistServe Architecture and Approach

DistServe innovatively separates the prefill and decoding phases, allocating them to different GPUs and thereby eliminating the prefill-decoding interference typically observed in co-located systems. This separation provides the flexibility to independently optimize resource allocation and parallelism strategies for each phase. DistServe can thus tailor its operations to meet specific latency requirements: time to first token (TTFT) for prefill and time per output token (TPOT) for decoding.

DistServe's architecture takes advantage of the distinct computational characteristics of prefill and decoding operations, adapting strategies to both contemporary hardware and stringent application requirements. Key to its implementation is a strategy for mapping workloads effectively across distributed systems, utilizing model parallelism—including both intra- and inter-operator techniques—while minimizing inter-GPU communication bottlenecks through bandwidth-aware placement algorithms.

Performance Evaluation

Evaluations demonstrate significant improvements in serving efficiency. Across various LLMs, DistServe is shown to handle up to 4.48 times the number of requests or accommodate 10.2 times tighter SLO constraints than existing state-of-the-art systems, all within the confines of latency requirements for over 90% of requests. This performance enhancement is attributable primarily to the reduction of task interference and optimized resource allocation enabled by disaggregation.

Through a comprehensive analysis, the paper establishes that the communication overhead introduced by disaggregation is negligible within modern GPU infrastructure, especially when considering network architectures equipped with sufficient intra-node bandwidth. Indeed, DistServe's execution strategy emphasizes efficiency in both compute-bound prefill operations and memory-bound decoding tasks, leveraging workload characteristics to determine optimal parallelism configurations.

Analytical Insights and Methodology

A significant portion of the paper focuses on the algorithmic and simulation-backed development of DistServe's design. The authors employ detailed modeling of LLM inference latency, capitalizing on predictable workload patterns to accurately simulate and optimize system configurations without extensive real-world testing. This decision-making is formalized in a two-stage placement algorithm that operates within the constraints of available hardware, optimizing parallelism through rigorous simulation rather than empirical trial and error.

The paper’s analytical framework effectively addresses potential pitfalls in scalability and execution efficiency, presenting a scalable approach that accounts for variable input lengths and heterogeneous network topologies. It provides a clear methodological advancement for deploying LLMs in environments where both computational efficiency and latency minimization are critical.

Implications and Future Work

The implications of DistServe extend beyond immediate performance gains. The flexible disaggregation strategy outlined could inform future developments in LLM service architecture, particularly as models grow in complexity and size. Moreover, this approach may inspire similar optimizations across other domains of distributed computing where task interference and resource coupling are prevalent.

The theoretical and practical insights on disaggregation and parallelism optimization could encourage further exploration into dynamic, need-based resource allocation models, potentially incorporating real-time adaptation to workload shifts. Future research might also explore integrating fault tolerance and advanced scheduling methods to mitigate the risks associated with fault propagation in disaggregated systems.

In essence, DistServe represents a significant stride toward efficient and cost-effective LLM deployment, setting a precedent for the deployment of next-generation AI systems under the increasingly ubiquitous requirements for high-throughput and low-latency service delivery.

PDF Markdown

Related Papers

Tweets

https://twitter.com/haoailab/status/1769834413121269829

https://twitter.com/gm8xx8/status/1769840041830875627

https://twitter.com/knishimae0531/status/1769881966940455387

https://twitter.com/kimbochen/status/1920138330655572045