Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Splitwise: Efficient generative LLM inference using phase splitting (2311.18677v2)

Published 30 Nov 2023 in cs.AR and cs.DC
Splitwise: Efficient generative LLM inference using phase splitting

Abstract: Recent innovations in generative LLMs have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Introduction

Generative LLMs have rapidly become a foundational aspect of modern artificial intelligence research and deployment. Utilizing these models effectively and efficiently during the inference stage is a challenge that has garnered significant attention, particularly given the implications on computational resources. In the paper "Splitwise: Efficient Generative LLM Inference Using Phase Splitting," a novel approach to address these challenges is presented, employing a technique called 'Splitwise' to enhance LLM inference clusters in terms of throughput, cost, and power objectives.

Phase Splitting in LLM Inference

One of the core insights from the paper is the identification of two distinct phases during LLM inference: prompt computation and token generation. Prompt computation is a compute-intensive phase requiring high-floating point calculations. Conversely, token generation is a memory-bound phase, involving serialized computation where each new token depends on previously generated tokens and cached context. Current models underutilize computing resources during the token generation phase; this is where Splitwise comes into play. It proposes offloading prompt computation and token generation onto separate machines, allowing each phase to leverage hardware that is optimally suited to its computational profile. This separation requires efficient state transfer—specifically the model's key-value cache—from the prompt-computing to token-generating machine, optimized through the use of high-speed networking available in GPU clusters.

Splitwise: Design and Optimization

The paper introduces the design of LLM inference clusters reinforced by Splitwise's phase-splitting technique, accounting for throughput, cost, and power consumption. Three key designs are proposed, assessing homogeneous and heterogeneous clusters with different GPU combinations, reflecting scenarios where the prompt phase is powered by high-performance GPUs and the token generation phase is supported by hardware optimized for memory bandwidth and capacity. Benchmark results indicate that the Splitwise-based designs can attain up to 1.4 times higher throughput for 20% lower cost compared to conventional designs, or alternatively, a 2.35 times increase in throughput within the same cost and power envelope. These enhancements result from targeting the distinct requirements of each inference phase, pushing the efficiency boundaries of LLM deployment.

Cluster Provisioning and Scalability

The provisioning methodology under Splitwise is thoroughly outlined, accommodating different models and workloads, and varying service level objectives (SLOs). The paper examines numerous cluster configurations, assuring SLO compliance across different percentile benchmarks for latencies (end-to-end, time to first token, and time between tokens). Splitwise's design is shown to be both malleable depending on the workload and robust to model variations or load fluctuations, suggesting significant potential for real-world applicability. Further discussions and relatable work acknowledge the potential for innovation in both hardware, fitted for prompt and token phases, and scheduling strategies on heterogeneous platforms.

In summary, "Splitwise: Efficient Generative LLM Inference Using Phase Splitting" offers a practical approach to optimizing the deployment of generative LLMs, achieving higher efficiency and throughput while balancing cost and power constraints. The presented technique and findings are anticipated to be crucial for the AI community as it leans towards more scalable and efficient use of LLMs in numerous applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. AMD Instinct™ MI250 Accelerator. [Online]. Available: https://www.amd.com/en/products/server-accelerators/instinct-mi250
  2. Azure InfiniBand HPC VMs. [Online]. Available: https://learn.microsoft.com/en-us/azure/virtual-machines/overview-hb-hc
  3. CoreWeave - Specialized Cloud Provider. [Online]. Available: https://www.coreweave.com
  4. Google Assistant with Bard. [Online]. Available: https://blog.google/products/assistant/google-assistant-bard-generative-ai/
  5. HPC Interconnect on CoreWeave Cloud. [Online]. Available: https://docs.coreweave.com/networking/hpc-interconnect
  6. Intel BigDL-LLM. [Online]. Available: https://github.com/intel-analytics/BigDL
  7. Intel Sapphire Rapids with HBM. [Online]. Available: https://www.anandtech.com/show/17422/intel-showcases-sapphire-rapids-plus-hbm-xeon-performance-isc-2022
  8. Microsoft Azure ND A100 v4-series . [Online]. Available: https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series
  9. MSCCL++: A GPU-driven communication stack for scalable AI applications. [Online]. Available: https://github.com/microsoft/mscclpp
  10. Numenta Inference on CPUs. [Online]. Available: https://www.servethehome.com/numenta-has-the-secret-to-ai-inference-on-cpus-like-the-intel-xeon-max/
  11. NVIDIA Accelerated InfiniBand Solutions. [Online]. Available: https://www.nvidia.com/en-us/networking/products/infiniband/
  12. NVIDIA chip shortage. [Online]. Available: https://www.wired.com/story/nvidia-chip-shortages-leave-ai-startups-scrambling-for-computing-power/
  13. NVIDIA Collective Communications Library (NCCL). [Online]. Available: https://developer.nvidia.com/nccl
  14. NVIDIA DGX H100. [Online]. Available: https://www.nvidia.com/en-us/data-center/dgx-h100/
  15. OpenAI ChatGPT APIs. [Online]. Available: https://openai.com/blog/introducing-chatgpt-and-whisper-apis
  16. Power availability stymies datacenter growth. [Online]. Available: https://www.networkworld.com/article/972483/power-availability-stymies-data-center-growth.
  17. The new Bing. [Online]. Available: https://www.microsoft.com/en-us/edge/features/the-new-bing?form=MT00D8
  18. TurboMind Inference server. [Online]. Available: https://github.com/InternLM/lmdeploy
  19. A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills,” 2023.
  20. R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale,” in SC, 2022.
  21. L. A. Barroso, U. Hölzle, and P. Ranganathan, “The Datacenter as a Computer: Designing Warehouse-Scale Machines.” [Online]. Available: https://www.morganclaypool.com/doi/abs/10.2200/S00874ED3V01Y201809CAC046
  22. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020.
  23. T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” 2022.
  24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  25. V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, “Analysis of join-the-shortest-queue routing for web server farms,” Performance Evaluation, vol. 64, no. 9, pp. 1062–1081, 2007, performance 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0166531607000624
  26. M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and K. S. McKinley, “Exploiting heterogeneity for tail latency and energy efficiency,” in MICRO, 2017.
  27. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in SOSP, 2023.
  28. Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” 2023.
  29. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019.
  30. Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re et al., “Deja vu: Contextual sparsity for efficient llms at inference time.”
  31. Meta. Introducing the AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research. [Online]. Available: https://ai.facebook.com/blog/ai-rsc/
  32. “Azure OpenAI Service,” Microsoft Azure, 2022. [Online]. Available: https://azure.microsoft.com/en-us/products/ai-services/openai-service
  33. R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Ratnasamy, and S. Shenker, “Revisiting network support for RDMA,” CoRR, vol. abs/1806.08159, 2018. [Online]. Available: http://arxiv.org/abs/1806.08159
  34. NVIDIA. DGX A100: Universal System for AI Infrastructure. [Online]. Available: https://resources.nvidia.com/en-us-dgx-systems/dgx-ai
  35. OpenAI. Scaling Kubernetes to 7,500 nodes. [Online]. Available: https://openai.com/research/scaling-kubernetes-to-7500-nodes
  36. P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “POLCA: Power Oversubscription in LLM Cloud Providers,” arXiv preprint arXiv:2308.12908, 2023.
  37. P. Patel, Z. Gong, S. Rizvi, E. Choukse, P. Misra, T. Anderson, and A. Sriraman, “Towards Improved Power Management in Cloud GPUs,” in IEEE CAL, 2023.
  38. P. Patel, K. Lim, K. Jhunjhunwalla, A. Martinez, M. Demoulin, J. Nelson, I. Zhang, and T. Anderson, “Hybrid Computing for Interactive Datacenter Applications,” arXiv preprint arXiv:2304.04488, 2023.
  39. R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” in MLSys, 2023.
  40. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  41. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “BLOOM: A 176B-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  42. P. Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed and Hugging Face Transformers. [Online]. Available: https://www.philschmid.de/fine-tune-flan-t5-deepspeed
  43. Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Re, I. Stoica, and C. Zhang, “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,” 2023.
  44. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  45. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. v. Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art Natural Language Processing,” in EMNLP, 2020.
  46. B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920, 2023.
  47. G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in OSDI, 2022.
  48. C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving,” in USENIX ATC, 2019. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/zhang-chengliang
  49. W. Zhu, “Analysis of JSQ policy on soft real-time scheduling in cluster,” in Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, vol. 1, 2000, pp. 277–282 vol.1.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Pratyush Patel (8 papers)
  2. Esha Choukse (15 papers)
  3. Chaojie Zhang (28 papers)
  4. Íñigo Goiri (18 papers)
  5. Aashaka Shah (7 papers)
  6. Saeed Maleki (19 papers)
  7. Ricardo Bianchini (13 papers)
Citations (98)