Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpotServe: Serving Generative Large Language Models on Preemptible Instances (2311.15566v1)

Published 27 Nov 2023 in cs.DC, cs.CL, and cs.LG

Abstract: The high computational and memory requirements of generative LLMs make it challenging to serve them cheaply. This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time. Serving LLMs on preemptible instances requires addressing challenges induced by frequent instance preemptions and the necessity of migrating instances to handle these preemptions. This paper presents SpotServe, the first distributed LLM serving system on preemptible instances. Several key techniques in SpotServe realize fast and reliable serving of generative LLMs on cheap preemptible instances. First, SpotServe dynamically adapts the LLM parallelization configuration for dynamic instance availability and fluctuating workload, while balancing the trade-off among the overall throughput, inference latency and monetary costs. Second, to minimize the cost of migrating instances for dynamic reparallelization, the task of migrating instances is formulated as a bipartite graph matching problem, which uses the Kuhn-Munkres algorithm to identify an optimal migration plan that minimizes communications. Finally, to take advantage of the grace period offered by modern clouds, we introduce stateful inference recovery, a new inference mechanism that commits inference progress at a much finer granularity and allows SpotServe to cheaply resume inference upon preemption. We evaluate on real spot instance preemption traces and various popular LLMs and show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.

Serving Generative LLMs on Preemptible Instances: An Examination of SpotServe

This essay provides an expert overview of the paper "SpotServe: Serving Generative LLMs on Preemptible Instances." The paper presents SpotServe, a pioneering system designed to address the computational and cost challenges of serving generative LLMs using preemptible GPU instances on cloud platforms.

Context and Challenges

Generative LLMs, such as GPT-4 and ChatGPT, have gained prominence due to their advanced capabilities in language understanding and generation. However, their substantial computational requirements pose significant cost challenges for deployment, especially for organizations with budget constraints. This paper targets the cost reduction of serving LLMs by employing preemptible GPU instances which are available at a reduced price compared to on-demand instances but can be preempted by the cloud provider at any time, often with a brief grace period.

SpotServe is introduced as the first system to serve distributed generative LLMs on preemptible instances, addressing three primary challenges:

  1. Dynamic Reparallelization: Due to changing instance availability, SpotServe dynamically adjusts the parallelization configuration to maintain optimized performance in terms of throughput and inference latency while balancing monetary costs.
  2. Instance Migration: Effective utilization of SpotServe necessitates minimizing instance migration overhead, treated as a bipartite graph matching problem, optimizing for communication cost during migration using the Kuhn-Munkres algorithm.
  3. Stateful Inference Recovery during Grace Period: Leveraging the autoregressive nature of LLMs, SpotServe employs a stateful inference recovery mechanism to commit progress at the token level during preemption, allowing inference to resume without recomputation.

Technical Contributions

SpotServe offers several innovative approaches:

  • Parallelization Controller: By dynamically adapting parallelization strategies in response to instance availability and workload fluctuations, SpotServe optimizes system throughput and latency through an adaptive configuration optimizer. This involves balancing data, tensor, and pipeline parallelism.
  • Efficient Context Migration: SpotServe reduces the overheads associated with migrating GPU instances by opportunistically reusing model parameters and inference states. The use of bipartite graph matching enables efficient device mapping, minimizing data transmission costs during context migration.
  • Interruption Handling: The system's interruption arranger intelligently manages inference suspensions and resumes based on conditions of instance preemption or acquisition, using just-in-time arrangements to maximize inference completions within the grace period.

Results and Implications

The evaluation results demonstrate that SpotServe significantly outperforms existing systems, reducing P99 tail latency by factors ranging from 2.4 to 9.1 and achieving monetary savings of up to 54% compared to on-demand instances. These results suggest that SpotServe offers both practical and theoretical advancements in the efficient deployment of large-scale LLMs.

The SpotServe framework introduces a novel paradigm for leveraging preemptible cloud resources for high-performance ML workloads, suggesting potential applications beyond LLMs to other domains demanding cost-effective distributed computation.

Future Research Directions

The deployment of SpotServe points to new avenues for research, including integrating heterogeneous resource types, exploring broader parallelization configurations, and optimizing dynamically for varying workloads beyond latency minimization. Additionally, Speculating on emerging trends in AI, SpotServe may serve as a foundational architecture as cloud providers enhance offerings around preemptible resources.

By addressing critical challenges in LLM serving on preemptible instances, SpotServe sets a precedent for future innovation in distributed AI systems, revealing opportunities for maximizing cost efficiency while maintaining computational performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Amazon ec2 spot instances. https://aws.amazon.com/ec2/spot/.
  2. Nvidia triton inference server. https://developer.nvidia.com/nvidia-triton-inference-server.
  3. Use azure spot virtual machines. https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms.
  4. Cuda ipc. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html, 2021.
  5. Nvidia fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2021.
  6. Nvidia nccl. https://developer.nvidia.com/nccl, 2021.
  7. https://vllm.ai, 2023.
  8. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI, 2016.
  9. Batch: Machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2020.
  10. Optimizing inference serving on serverless platforms. Proceedings of the VLDB Endowment, 15(10), 2022.
  11. Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2022.
  12. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022.
  13. Language models are few-shot learners, 2020.
  14. Sla-driven ml inference framework for clouds with heterogeneous accelerators. In D. Marculescu, Y. Chi, and C. Wu, editors, Proceedings of Machine Learning and Systems, volume 4, pages 20–32, 2022.
  15. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, Boston, MA, March 2017. USENIX Association.
  16. Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas, 2016.
  17. Turbotransformers: an efficient GPU serving system for transformer models. In Jaejin Lee and Erez Petrank, editors, PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 389–402. ACM, 2021.
  18. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  19. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041–1057, Renton, WA, April 2022. USENIX Association.
  20. Deft: Slo-driven preemptive scheduling for containerized dnn serving. In Symposium on Networked Systems Design and Implementation, 2023.
  21. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 103–112, 2019.
  22. Beyond data and model parallelism for deep neural networks. In Proceedings of the 2nd Conference on Systems and Machine Learning, SysML’19, 2019.
  23. Parity models: erasure-coded resilience for prediction serving systems. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 30–46, 2019.
  24. Gshard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668, 2020.
  25. Tetris: Memory-efficient serverless inference through tensor sharing. In Jiri Schindler and Noa Zilberman, editors, 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022. USENIX Association, 2022.
  26. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. CoRR, abs/2302.11665, 2023.
  27. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  28. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Proc. VLDB Endow., 16(3):470–479, 2023.
  29. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery.
  30. Memory-efficient pipeline-parallel DNN training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 7937–7947. PMLR, 2021.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. fairseq: A fast, extensible toolkit for sequence modeling. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Demonstrations, pages 48–53. Association for Computational Linguistics, 2019.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411. USENIX Association, July 2021.
  35. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 205–218. USENIX Association, July 2020.
  36. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In SOSP ’19: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
  37. High-throughput generative inference of large language models with a single GPU. CoRR, abs/2303.06865, 2023.
  38. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
  39. Optimizing prediction serving on low-latency serverless dataflow. CoRR, abs/2007.05832, 2020.
  40. Piper: Multidimensional planner for dnn parallelization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 24829–24840. Curran Associates, Inc., 2021.
  41. Bamboo: Making preemptible instances resilient for affordable training of large dnns. CoRR, abs/2204.12013, 2022.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  43. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 267–284. USENIX Association, 2022.
  44. Attention is all you need. CoRR, abs/1706.03762, 2017.
  45. Lightseq: A high performance inference library for transformers. In Young-bum Kim, Yunyao Li, and Owen Rambow, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, NAACL-HLT 2021, Online, June 6-11, 2021, pages 113–120. Association for Computational Linguistics, 2021.
  46. Switchflow: preemptive multitasking for deep learning. In Proceedings of the 22nd International Middleware Conference, pages 146–158, 2021.
  47. Snape: Reliable and low-cost computing with mixture of spot and on-demand vms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 631–643, 2023.
  48. SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, Boston, MA, April 2023. USENIX Association.
  49. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  50. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In USENIX Annual Technical Conference, pages 1049–1062, 2019.
  51. Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243, 2019.
  52. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, Boston, MA, April 2023. USENIX Association.
  53. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  54. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 559–578. USENIX Association, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xupeng Miao (37 papers)
  2. Chunan Shi (4 papers)
  3. Jiangfei Duan (8 papers)
  4. Xiaoli Xi (2 papers)
  5. Dahua Lin (336 papers)
  6. Bin Cui (165 papers)
  7. Zhihao Jia (43 papers)
Citations (40)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com