DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2401.08671v1)

Published 9 Jan 2024 in cs.PF and cs.LG

Abstract: The deployment and scaling of LLMs have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

References (13)

Authors (11)

Connor Holmes (20 papers)
Masahiro Tanaka (39 papers)
Michael Wyatt (6 papers)
Ammar Ahmad Awan (15 papers)
Jeff Rasley (10 papers)
Samyam Rajbhandari (21 papers)
Reza Yazdani Aminabadi (10 papers)
Heyang Qin (6 papers)
Arash Bakhtiari (5 papers)
Lev Kurilenko (4 papers)
Yuxiong He (59 papers)

Citations (31)

View on Semantic Scholar

Summary

Introduction

The efficient deployment and scaling of LLMs are imperative as they become integral to various applications requiring both high-throughput and low-latency systems. While training frameworks can achieve good hardware utilization, the inefficiencies during inference, particularly for tasks with long prompts, have remained a bottleneck. Existing systems like vLLM, incorporating innovations like PagedAttention, have improved LLM inference performance but still falter in providing consistent service for longer workloads, a crucial aspect as context windows in models extend significantly.

Advancements in LLM Serving

In the landscape of LLM serving, frameworks have struggled with the dichotomy between prompt processing and token generation, phases that previously have been treated separately, thus risking service level agreement breaches. Concurrently, methods like continuous batching have aimed to optimize GPU utility, though challenges like input padding and batching delays remained. Notably, vLLM's blocked KV Caching and Orca's iteration-level scheduling have marked the literature's attempts to combat these hurdles, positioning themselves as significant strides, yet incomplete solutions, to the serving challenges.

DeepSpeed-FastGen's Dynamic SplitFuse

DeepSpeed-FastGen proposes a novel strategy, Dynamic SplitFuse, which dynamically decomposes and composes prompts and generation for improved serving throughput and latency. This technique allows the system to maintain high occupancy and responsiveness, crucial for data center deployment of LLMs. Dynamic SplitFuse answers relevant performance questions encompassing factors affecting LLM forward passes, throughput response to token quantity changes, and optimal token scheduling. By decomposing long prompts and composing short prompts to meet token budgets, it ensures that the system consistently operates in a high-throughput regime while reducing latency and variance.

Evaluation and System Performance

Comprehensive benchmarks against state-of-the-art systems demonstrate DeepSpeed-FastGen's prowess, showcasing up to 2.3x higher effective throughput and substantial latency improvements. By evaluating throughput-latency curves and effective throughput, we can quantify DeepSpeed-FastGen's superiority. The system's built-in load balancing further allows for scalability across multiple nodes. Implementation-wise, DeepSpeed-FastGen brings together DeepSpeed-MII and DeepSpeed-Inference for an easy-to-use serving system supported by detailed instructions and advanced installation options.

Conclusion

In summary, DeepSpeed-FastGen represents a significant leap forward in the serving of LLMs, addressing and overcoming the critical challenges faced by its predecessors. By providing lower latency, enhanced consistency, and higher overall efficiency, it paves the way for a broader application of LLMs across industry and research environments, solidifying its potential as a foundational tool for the next generation of AI-driven services.

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1747810093414932741

https://twitter.com/_akhaliq/status/1747819263581171788

https://twitter.com/fly51fly/status/1748096326963998997

https://twitter.com/adnanhofficial/status/1748718394269180344

https://twitter.com/javaeeeee1/status/1749411366438310316

https://twitter.com/javaeeeee1/status/1749056941215691182

YouTube

Show All Videos