Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction (2404.08509v2)

Published 12 Apr 2024 in cs.DC, cs.CL, and cs.LG

Abstract: LLMs have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Triton inference server. https://developer.nvidia.com/triton-inference-server.
  2. PipeSwitch: Fast pipelined context switching for deep learning applications. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pages 499–514, 2020.
  3. Fu Bang and Feng Di. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218, 2023.
  4. Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, 2023. https://github.com/FasterDecoding/Medusa.
  5. InferLine: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, page 477–491, New York, NY, USA, 2020. Association for Computing Machinery.
  6. Clipper: A low-latency online prediction serving system. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2017), pages 613–627, 2017.
  7. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the 2019 North American Chapter of the Association for Computational Linguistics (ACL 2019), 2019.
  8. Break the sequential dependency of LLM inference using lookahead decoding, 2023. https://lmsys.org/blog/2023-11-21-lookahead-decoding/.
  9. GitHub. GitHub Copilot. https://github.com/features/copilot, 2024. Accessed: 2024-03-18.
  10. Google. An important next step on our AI journey. https://blog.google/technology/ai/bard-google-ai-search-updates/, 2023. Accessed: 2024-03-18.
  11. Serving DNNs like Clockwork: Performance predictability from the bottom up. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462, 2020.
  12. Inference without interference: Disaggregate LLM inference for mixed downstream workloads, 2024.
  13. A case for task sampling based learning for cluster job scheduling. In Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2022), pages 19–33, Renton, WA, 2022. USENIX Association.
  14. S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Increasing GPU utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems (NeurIPS 2023), 36, 2023.
  15. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 2023), pages 611–626, 2023.
  16. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), pages 19274–19286. PMLR, 2023.
  17. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2023), pages 663–679, 2023.
  18. SpecInfer: Accelerating generative LLM serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781.
  19. A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing, 76:122–149, 2020.
  20. TensorFlow-Serving: Flexible, high-performance ML serving. In Workshop on ML Systems 2017, 2017.
  21. OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022. Accessed: 2024-03-18.
  22. ZygOS: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP 2017), pages 325–341, 2017.
  23. Yonatan Shadmi. Fluid limits for shortest job first with aging. Queueing Systems, 101(1-2):93–112, 2022.
  24. Regression models with ordinal variables. American Sociological Review, pages 512–525, 1984.
  25. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
  26. LsPS: A job size-based scheduler for efficient task assignments in Hadoop. IEEE Transactions on Cloud Computing, 3(4):411–424, 2015.
  27. Orca: A distributed serving system for Transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022), pages 521–538, 2022.
  28. Faster and cheaper serverless computing on harvested resources. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP 2021), pages 724–739, 2021.
  29. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. arXiv preprint arXiv:2309.11998, 2023.
  30. Response length perception and sequence scheduling: An LLM-empowered LLM inference pipeline. Advances in Neural Information Processing Systems (NeurIPS 2023), 36, 2023.
  31. PREP: Predicting job runtime with job running path on supercomputers. In Proceedings of the 50th International Conference on Parallel Processing (ICPP 2021), New York, NY, USA, 2021. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Haoran Qiu (10 papers)
  2. Weichao Mao (11 papers)
  3. Archit Patke (7 papers)
  4. Shengkun Cui (6 papers)
  5. Saurabh Jha (55 papers)
  6. Chen Wang (599 papers)
  7. Hubertus Franke (15 papers)
  8. Zbigniew T. Kalbarczyk (12 papers)
  9. Tamer Başar (200 papers)
  10. Ravishankar K. Iyer (22 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com