Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (2403.02310v3)

Published 4 Mar 2024 in cs.LG and cs.DC

Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

Analyzing the Trade-offs in LLM Inference Through Sarathi-Serve

The paper "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve" presents an advanced scheduling approach for optimizing the serving of LLMs, focusing on the dual aspects of throughput and latency. The methodologies employed by the authors build upon prior work encapsulated in Sarathi, and they introduce innovations aimed at reducing the existing trade-offs faced in contemporary LLM inference systems. This discussion intends to examine and convey the foundational principles, notable results, and potential developments for the AI research community.

Core Mechanisms and Contributions

The authors introduce Sarathi-Serve, an inference scheduler that capitalizes on two key strategies: chunked-prefills and stall-free batching. The concept of chunked-prefills dissects lengthy prompt-prefill operations into manageable segments, distributing them across inference iterations. By doing so, the system alleviates the latency spikes typically introduced by full-length prefill operations in iteration-level batching systems like vLLM and Orca.

Stall-free batching further complements this by allowing decode operations to proceed uninterrupted even as new requests are introduced. Unlike decode-prioritizing systems that might compromise on throughput by stalling until all requests in a batch have been processed, Sarathi-Serve's innovative batching mechanism adeptly handles mixed batches of prefill and decode phases without blocking, maintaining a balance between low time-between-tokens (TBT) and high throughput.

Quantitative Evaluation

The evaluations carried out across different models, including Mistral-7B and Falcon-180B, and on diverse hardware configurations, demonstrate the efficacy of Sarathi-Serve. The paper reports a significant improvement in serving capacity, with up to 2.6x enhancement on a single A100 GPU for Mistral-7B and up to 6.9x on eight A100 GPUs for the Falcon-180B model, compared to existing systems such as Orca and vLLM.

The results also underscore Sarathi-Serve's ability to hold steady under stringent P99 TBT SLOs, effectively mitigating generation stalls that plague prefill-prioritizing systems during inference. These performance gains are attributed to the calculated integration of prefills and decodes within token budget boundaries, determined by profiling to fit TBT SLOs.

Theoretical and Practical Implications

From a theoretical perspective, the paper brings to light the intricacies involved in harmonizing throughput and latency, and how advanced scheduling techniques can alleviate innate trade-offs. The chunked-prefill method incorporates an understanding of GPU scheduling and execution mechanics, leveraging computation slack in decode iterations without extensively penalizing compute resources.

Practically, the implementation and open-source availability of Sarathi-Serve carry substantial implications for evolving LLM use, especially in applications relying on synchronous interaction and responsive LLMs like conversational agents and real-time processing systems.

Future Directions in AI Research

Sarathi-Serve sets a precedent in the context of handling LLM inference workloads, yet it opens up several avenues for further exploration. Future research can explore dynamic token budgeting that better adapts to workload shifts, reducing overhead while maintaining seamless batching operations. Additionally, merging Sarathi-Serve’s mechanisms with distributed LLM architectures could potentially enhance its scalability across different network configurations and LLMs of varying complexity.

Moreover, considerations around integrating Sarathi-Serve with multi-modal models and distributed serving frameworks may well form the next frontier, where adaptation to data diversity and distribution constraints will require further innovation.

In conclusion, the development of Sarathi-Serve marks a methodical advancement in the field of LLM serving frameworks, providing robust solutions to persistent challenges related to throughput and latency in AI infrastructure. This work not only contributes to efficiency improvements but also enables more nuanced and advanced applications of AI in real-world systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Amazon codewhisperer. https://aws.amazon.com/codewhisperer/.
  2. Anthropic claude. https://claude.ai.
  3. arxiv.org e-print archive. https://arxiv.org/.
  4. Bing ai. https://www.bing.com/chat.
  5. Character ai. https://character.ai.
  6. Chatgpt. https://chat.openai.com.
  7. Faster Transformer. https://github.com/NVIDIA/FasterTransformer.
  8. Github copilot. https://github.com/features/copilot.
  9. Google bard. https://bard.google.com.
  10. Google duet ai. https://workspace.google.com/solutions/ai/.
  11. Komo. https://komo.ai/.
  12. Lightllm: A light and fast inference service for llm. https://github.com/ModelTC/lightllm.
  13. Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html.
  14. Microsoft copilot. https://www.microsoft.com/en-us/microsoft-copilot.
  15. Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl.
  16. NVIDIA Triton Dynamic Batching. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher.
  17. Openai gpt-3: Understanding the architecture. https://www.theaidream.com/post/openai-gpt-3-understanding-the-architecture.
  18. Perplexity ai. https://www.perplexity.ai/.
  19. Replit ghostwriter. https://replit.com/site/ghostwriter.
  20. Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM.
  21. https://github.com/vllm-project/vllm.
  22. XFORMERS OPTIMIZED OPERATORS. https://facebookresearch.github.io/xformers/components/ops.html.
  23. Yi series of large language models trained from scratch by developers at 01.AI. https://huggingface.co/01-ai/Yi-34B-200K.
  24. You.com. https://you.com/.
  25. Apiserve: Efficient api support for large-language model inferencing. arXiv preprint arXiv:2402.01869, 2024.
  26. Abien Fred Agarap. Deep learning using rectified linear units (relu), 2019.
  27. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  28. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  29. The falcon series of open language models, 2023.
  30. Efficient large scale language modeling with mixtures of experts, 2022.
  31. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  32. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
  33. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  34. Clipper: A {{\{{Low-Latency}}\}} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
  35. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  36. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  37. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  38. Qlora: Efficient finetuning of quantized llms, 2023.
  39. Turbotransformers: an efficient GPU serving system for transformer models. In PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 389–402. ACM, 2021.
  40. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  41. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
  42. Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462, 2020.
  43. Gaussian error linear units (gelus), 2023.
  44. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024.
  45. Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
  46. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023.
  47. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 402–416, New York, NY, USA, 2022. Association for Computing Machinery.
  48. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  49. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
  50. Efficient memory management for large language model serving with pagedattention. SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  51. Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association.
  52. Tensorflow-serving: Flexible, high-performance ml serving, 2017.
  53. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  54. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  55. Splitwise: Efficient generative llm inference using phase splitting, 2023.
  56. Efficiently scaling transformer inference, 2022.
  57. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory, 2022.
  58. Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
  59. Fairness in serving large language models. arXiv preprint arXiv:2401.00588, 2023.
  60. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023.
  61. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  62. Retentive network: A successor to transformer for large language models, 2023.
  63. Llama 2: Open foundation and fine-tuned chat models, 2023.
  64. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  65. Openchat: Advancing open-source language models with mixed-quality data, 2023.
  66. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2023, page 93–106, New York, NY, USA, 2022. Association for Computing Machinery.
  67. LightSeq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers (NAACL-HLT), pages 113–120. Association for Computational Linguistics, June 2021.
  68. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
  69. Fast distributed inference serving for large language models, 2023.
  70. Smoothquant: Accurate and efficient post-training quantization for large language models, 2023.
  71. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  72. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
  73. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Amey Agrawal (10 papers)
  2. Nitin Kedia (3 papers)
  3. Ashish Panwar (8 papers)
  4. Jayashree Mohan (17 papers)
  5. Nipun Kwatra (18 papers)
  6. Bhargav S. Gulavani (2 papers)
  7. Alexey Tumanov (30 papers)
  8. Ramachandran Ramjee (20 papers)
Citations (64)
Youtube Logo Streamline Icon: https://streamlinehq.com