Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency (2408.00008v2)

Published 23 Jul 2024 in cs.DC and cs.LG

Abstract: LLMs have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  2. Fireworks AI. Fireworks ai. http://fireworks.ai. Accessed: 2024-07-16.
  3. Lucio Franco. Tonic: 0.1 has arrived! https://luciofran.co/tonic-0-1-release/, 2020.
  4. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  5. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  6. Marina Lammertyn. 60+ chatgpt statistics and facts you need to know in 2024. https://blog.invgate.com/chatgpt-statistics, 2024.
  7. Announcing tokio 0.1. https://tokio.rs/blog/2017-01-tokio-0-1, 2017.
  8. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  9. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024.
  10. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024.
  11. NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2024-07-16.
  12. OpenAI. Openai api. https://platform.openai.com/docs/api-reference/introduction, 2024.
  13. David Pedersen. Announcing axum. https://tokio.rs/blog/2021-07-announcing-axum, 2021.
  14. Together AI. Together ai. http://together.ai. Accessed: 2024-07-16.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  16. VLLM AI. Vllm ai. http://vllm.ai. Accessed: 2024-07-16.
  17. Huggingface’s transformers: State-of-the-art natural language processing, 2019. URL https://github.com/huggingface/transformers.
  18. Soft prompt recovers compressed llms, transferably. In International Conference on Machine Learning. PMLR, 2024.
  19. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024.
  20. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.