Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving (2405.06856v1)

Published 11 May 2024 in cs.DC

Abstract: The demand for LLM inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Taming throughput-latency tradeoff in llm inference with sarathi-serve, 2024.
  2. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  3. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore, December 2023. Association for Computational Linguistics.
  4. Marc Brysbaert. How many words do we read per minute? a review and meta-analysis of reading rate. Journal of memory and language, 109:104047, 2019.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, Boston, MA, March 2017. USENIX Association.
  7. Multi-resource packing for cluster schedulers. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page 455–466, New York, NY, USA, 2014. Association for Computing Machinery.
  8. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  9. Deep residual learning for image recognition, 2015.
  10. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024.
  11. Morpheus: Towards automated SLOs for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117–134, Savannah, GA, November 2016. USENIX Association.
  12. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  13. Adam Letchford. Approximation algorithms: Vv vazirani, springer-verlag, 2001. Journal of the Operational Research Society, 53:807–808, 07 2002.
  14. Andes: Defining and enhancing quality-of-experience in llm-based text streaming services, 2024.
  15. Cheaply estimating inference efficiency metrics for autoregressive transformer models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 66518–66538. Curran Associates, Inc., 2023.
  16. The new york times. The desperate hunt for the a.i. boom’s most indispensable prize. https://www.nytimes.com/2023/08/16/technology/ai-gpu-chips-shortage.html, 2023.
  17. NVIDIA. cublas. https://docs.nvidia.com/cuda/cublas/index.html, 2023.
  18. Exegpt: Constraint-aware resource scheduling for llm inference. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 369–384, New York, NY, USA, 2024. Association for Computing Machinery.
  19. OpenAI. Gpts. https://openai.com/blog/introducing-gpts, 2023.
  20. Splitwise: Efficient generative llm inference using phase splitting, 2023.
  21. Efficiently scaling transformer inference, 2022.
  22. Efficient interactive llm serving with proxy model-based sequence length prediction, 2024.
  23. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411. USENIX Association, July 2021.
  24. Fairness in serving large language models, 2023.
  25. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023.
  26. Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving, 2024.
  27. Sharegpt teams. Sharegpt. https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered, 2023.
  28. Llama 2: Open foundation and fine-tuned chat models, 2023.
  29. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  30. MArk: Exploiting cloud services for Cost-Effective, SLO-Aware machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, Renton, WA, July 2019. USENIX Association.
  31. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline, 2023.
  32. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chengyi Nie (3 papers)
  2. Rodrigo Fonseca (23 papers)
  3. Zhenhua Liu (47 papers)
Citations (1)