Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2401.08671v1)

Published 9 Jan 2024 in cs.PF and cs.LG

Abstract: The deployment and scaling of LLMs have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. OpenAI. Gpt-4. https://openai.com/gpt-4, 2023.
  2. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  3. Pytorch: An imperative style, high-performance deep learning library, 2019.
  4. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
  5. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. 2022.
  6. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://www.mosaicml.com/blog/mpt-7b, 2023.
  7. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
  8. HuggingFace. Text generation inference. https://huggingface.co/text-generation-inference.
  9. NVIDIA. Nvidia tensorrt-llm: A tensorrt toolbox for large language model. https://github.com/NVIDIA/TensorRT-LLM.
  10. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
  11. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  12. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  13. NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Connor Holmes (20 papers)
  2. Masahiro Tanaka (39 papers)
  3. Michael Wyatt (6 papers)
  4. Ammar Ahmad Awan (15 papers)
  5. Jeff Rasley (10 papers)
  6. Samyam Rajbhandari (21 papers)
  7. Reza Yazdani Aminabadi (10 papers)
  8. Heyang Qin (6 papers)
  9. Arash Bakhtiari (5 papers)
  10. Lev Kurilenko (4 papers)
  11. Yuxiong He (59 papers)
Citations (31)

Summary

Introduction

The efficient deployment and scaling of LLMs are imperative as they become integral to various applications requiring both high-throughput and low-latency systems. While training frameworks can achieve good hardware utilization, the inefficiencies during inference, particularly for tasks with long prompts, have remained a bottleneck. Existing systems like vLLM, incorporating innovations like PagedAttention, have improved LLM inference performance but still falter in providing consistent service for longer workloads, a crucial aspect as context windows in models extend significantly.

Advancements in LLM Serving

In the landscape of LLM serving, frameworks have struggled with the dichotomy between prompt processing and token generation, phases that previously have been treated separately, thus risking service level agreement breaches. Concurrently, methods like continuous batching have aimed to optimize GPU utility, though challenges like input padding and batching delays remained. Notably, vLLM's blocked KV Caching and Orca's iteration-level scheduling have marked the literature's attempts to combat these hurdles, positioning themselves as significant strides, yet incomplete solutions, to the serving challenges.

DeepSpeed-FastGen's Dynamic SplitFuse

DeepSpeed-FastGen proposes a novel strategy, Dynamic SplitFuse, which dynamically decomposes and composes prompts and generation for improved serving throughput and latency. This technique allows the system to maintain high occupancy and responsiveness, crucial for data center deployment of LLMs. Dynamic SplitFuse answers relevant performance questions encompassing factors affecting LLM forward passes, throughput response to token quantity changes, and optimal token scheduling. By decomposing long prompts and composing short prompts to meet token budgets, it ensures that the system consistently operates in a high-throughput regime while reducing latency and variance.

Evaluation and System Performance

Comprehensive benchmarks against state-of-the-art systems demonstrate DeepSpeed-FastGen's prowess, showcasing up to 2.3x higher effective throughput and substantial latency improvements. By evaluating throughput-latency curves and effective throughput, we can quantify DeepSpeed-FastGen's superiority. The system's built-in load balancing further allows for scalability across multiple nodes. Implementation-wise, DeepSpeed-FastGen brings together DeepSpeed-MII and DeepSpeed-Inference for an easy-to-use serving system supported by detailed instructions and advanced installation options.

Conclusion

In summary, DeepSpeed-FastGen represents a significant leap forward in the serving of LLMs, addressing and overcoming the critical challenges faced by its predecessors. By providing lower latency, enhanced consistency, and higher overall efficiency, it paves the way for a broader application of LLMs across industry and research environments, solidifying its potential as a foundational tool for the next generation of AI-driven services.

Youtube Logo Streamline Icon: https://streamlinehq.com