DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2401.08671v1)
Abstract: The deployment and scaling of LLMs have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.
- OpenAI. Gpt-4. https://openai.com/gpt-4, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. 2022.
- MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://www.mosaicml.com/blog/mpt-7b, 2023.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
- HuggingFace. Text generation inference. https://huggingface.co/text-generation-inference.
- NVIDIA. Nvidia tensorrt-llm: A tensorrt toolbox for large language model. https://github.com/NVIDIA/TensorRT-LLM.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer.
- Connor Holmes (20 papers)
- Masahiro Tanaka (39 papers)
- Michael Wyatt (6 papers)
- Ammar Ahmad Awan (15 papers)
- Jeff Rasley (10 papers)
- Samyam Rajbhandari (21 papers)
- Reza Yazdani Aminabadi (10 papers)
- Heyang Qin (6 papers)
- Arash Bakhtiari (5 papers)
- Lev Kurilenko (4 papers)
- Yuxiong He (59 papers)