Fairness in Serving Large Language Models (2401.00588v2)
Abstract: High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
- Jens Axboe. Linux block io—present and future. In Ottawa Linux Symp, pages 51–61, 2004.
- Data networks. Athena Scientific, 2021.
- Group ratio round-robin: O(1) proportional share scheduling for uniprocessor and multiprocessor systems. In USENIX Annual Technical Conference (ATC), pages 337–352. USENIX, 2005.
- Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review, 19(4):1–12, 1989.
- Analysis and simulation of a fair queueing algorithm. In Lawrence H. Landweber, editor, ACM Symposium on Communications Architectures & Protocols (SIGCOMM), pages 1–12. ACM, 1989.
- Peter L Dorlan. An introduction to computer networks. Autoedición, 2016.
- Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 389–402, 2021.
- Dominant resource fairness: fair allocation of multiple resource types. In Proceedings of Networks and Systems Design and Implementation (NSDI), 2011.
- S. Jamaloddin Golestani. A self-clocked fair queueing scheme for broadband applications. In Proceedings IEEE INFOCOM ’94, The Conference on Computer Communications, Thirteenth Annual Joint Conference of the IEEE Computer and Communications Societies, Networking for Global Communications, pages 636–646. IEEE Computer Society, 1994.
- Start-time fair queueing: a scheduling algorithm for integrated services packet switching networks. IEEE/ACM Transactions on Networking, 5(5):690–704, 1997.
- Start-time fair queueing: A scheduling algorithm for integrated services packet switching networks. In Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pages 157–168. ACM, 1996.
- Altruistic scheduling in Multi-Resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 65–80, Savannah, GA, November 2016. USENIX Association.
- Hugging Face. Text generation inference. https://github.com/huggingface/text-generation-inference. Accessed: 2023-11.
- HuggingFace. Text-generation-inference(tgi). https://github.com/huggingface/text-generation-inference, 2023.
- Quincy: Fair scheduling for distributed computing clusters. In ACM Symposium on Operating Systems Principles (SOSP), page 261–276. Association for Computing Machinery, 2009.
- Interposed proportional sharing for a storage service utility. In International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), pages 37–48. ACM, 2004.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- {{\{{AlpaServe}}\}}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
- Themis: Fair and efficient gpu cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, 2020.
- P.E. McKenney. Stochastic fairness queueing. In Proceedings. IEEE INFOCOM ’90: Ninth Annual Joint Conference of the IEEE Computer and Communications Societies@m_The Multiple Facets of Integration, pages 733–740 vol.2, 1990.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
- ModelTC. Lightllm: Python-based llm inference and serving framework. https://github.com/ModelTC/lightllm, 2023. GitHub repository.
- J. Nagle. On packet switches with infinite storage. IEEE Transactions on Communications, 35(4):435–438, 1987.
- Cheaply estimating inference efficiency metrics for autoregressive transformer models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- {{\{{Heterogeneity-Aware}}\}} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481–498, 2020.
- Fair queuing memory systems. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), pages 208–222, 2006.
- NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer.
- OpenAI. Openai api reference. https://platform.openai.com/docs/api-reference. Accessed: 2023-11.
- OpenAI. Gpt-4 turbo, 2023.
- OpenAI. Rate limit. https://platform.openai.com/docs/guides/rate-limits?context=tier-free, 2023.
- A generalized processor sharing approach to flow control in integrated services networks: the single-node case. IEEE/ACM Transactions on Networking, 1(3):344–357, 1993.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 21), 2021.
- A unified framework for max-min and min-max fairness with applications. IEEE/ACM Transactions on networking, 15(5):1073–1083, 2007.
- Fairness in parallel job scheduling. Journal of Scheduling, 3(5):297–320, 2000.
- S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
- Flexgen: high-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
- M. Shreedhar and George Varghese. Efficient fair queueing using deficit round-robin. IEEE/ACM Trans. Netw., 4(3):375–385, 1996.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
- A proportional share resource allocation algorithm for real-time, time-shared systems. In IEEE Real-Time Systems Symposium (RTSS), pages 288–299. IEEE Computer Society, 1996.
- Lightseq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pages 113–120, 2021.
- Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
- Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pages 265–278, 2010.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In Conference on Neural Information Processing Systems, Datasets and Benchmarks Track, 2023.
- Pets: A unified framework for parameter-efficient transformers serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 489–504, 2022.
- Ying Sheng (31 papers)
- Shiyi Cao (15 papers)
- Dacheng Li (22 papers)
- Banghua Zhu (38 papers)
- Zhuohan Li (29 papers)
- Danyang Zhuo (33 papers)
- Joseph E. Gonzalez (167 papers)
- Ion Stoica (177 papers)