Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

17 1

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (2403.02310v3)

Published 4 Mar 2024 in cs.LG and cs.DC

Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

PDF HTML Abstract

Analyzing the Trade-offs in LLM Inference Through Sarathi-Serve

The paper "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve" presents an advanced scheduling approach for optimizing the serving of LLMs, focusing on the dual aspects of throughput and latency. The methodologies employed by the authors build upon prior work encapsulated in Sarathi, and they introduce innovations aimed at reducing the existing trade-offs faced in contemporary LLM inference systems. This discussion intends to examine and convey the foundational principles, notable results, and potential developments for the AI research community.

Core Mechanisms and Contributions

The authors introduce Sarathi-Serve, an inference scheduler that capitalizes on two key strategies: chunked-prefills and stall-free batching. The concept of chunked-prefills dissects lengthy prompt-prefill operations into manageable segments, distributing them across inference iterations. By doing so, the system alleviates the latency spikes typically introduced by full-length prefill operations in iteration-level batching systems like vLLM and Orca.

Stall-free batching further complements this by allowing decode operations to proceed uninterrupted even as new requests are introduced. Unlike decode-prioritizing systems that might compromise on throughput by stalling until all requests in a batch have been processed, Sarathi-Serve's innovative batching mechanism adeptly handles mixed batches of prefill and decode phases without blocking, maintaining a balance between low time-between-tokens (TBT) and high throughput.

Quantitative Evaluation

The evaluations carried out across different models, including Mistral-7B and Falcon-180B, and on diverse hardware configurations, demonstrate the efficacy of Sarathi-Serve. The paper reports a significant improvement in serving capacity, with up to 2.6x enhancement on a single A100 GPU for Mistral-7B and up to 6.9x on eight A100 GPUs for the Falcon-180B model, compared to existing systems such as Orca and vLLM.

The results also underscore Sarathi-Serve's ability to hold steady under stringent P99 TBT SLOs, effectively mitigating generation stalls that plague prefill-prioritizing systems during inference. These performance gains are attributed to the calculated integration of prefills and decodes within token budget boundaries, determined by profiling to fit TBT SLOs.

Theoretical and Practical Implications

From a theoretical perspective, the paper brings to light the intricacies involved in harmonizing throughput and latency, and how advanced scheduling techniques can alleviate innate trade-offs. The chunked-prefill method incorporates an understanding of GPU scheduling and execution mechanics, leveraging computation slack in decode iterations without extensively penalizing compute resources.

Practically, the implementation and open-source availability of Sarathi-Serve carry substantial implications for evolving LLM use, especially in applications relying on synchronous interaction and responsive LLMs like conversational agents and real-time processing systems.

Future Directions in AI Research

Sarathi-Serve sets a precedent in the context of handling LLM inference workloads, yet it opens up several avenues for further exploration. Future research can explore dynamic token budgeting that better adapts to workload shifts, reducing overhead while maintaining seamless batching operations. Additionally, merging Sarathi-Serve’s mechanisms with distributed LLM architectures could potentially enhance its scalability across different network configurations and LLMs of varying complexity.

Moreover, considerations around integrating Sarathi-Serve with multi-modal models and distributed serving frameworks may well form the next frontier, where adaptation to data diversity and distribution constraints will require further innovation.

In conclusion, the development of Sarathi-Serve marks a methodical advancement in the field of LLM serving frameworks, providing robust solutions to persistent challenges related to throughput and latency in AI infrastructure. This work not only contributes to efficiency improvements but also enables more nuanced and advanced applications of AI in real-world systems.

PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (8)

Amey Agrawal (10 papers)
Nitin Kedia (3 papers)
Ashish Panwar (8 papers)
Jayashree Mohan (17 papers)
Nipun Kwatra (18 papers)
Bhargav S. Gulavani (2 papers)
Alexey Tumanov (30 papers)
Ramachandran Ramjee (20 papers)

Citations (64)

View on Semantic Scholar

Tweets

https://twitter.com/alsched/status/1796088595402613224

https://twitter.com/ShawnZhong_/status/1787166600472261109

YouTube

Show All Videos