ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency (2408.00008v2)
Abstract: LLMs have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Fireworks AI. Fireworks ai. http://fireworks.ai. Accessed: 2024-07-16.
- Lucio Franco. Tonic: 0.1 has arrived! https://luciofran.co/tonic-0-1-release/, 2020.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- Marina Lammertyn. 60+ chatgpt statistics and facts you need to know in 2024. https://blog.invgate.com/chatgpt-statistics, 2024.
- Announcing tokio 0.1. https://tokio.rs/blog/2017-01-tokio-0-1, 2017.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024.
- Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024.
- NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2024-07-16.
- OpenAI. Openai api. https://platform.openai.com/docs/api-reference/introduction, 2024.
- David Pedersen. Announcing axum. https://tokio.rs/blog/2021-07-announcing-axum, 2021.
- Together AI. Together ai. http://together.ai. Accessed: 2024-07-16.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- VLLM AI. Vllm ai. http://vllm.ai. Accessed: 2024-07-16.
- Huggingface’s transformers: State-of-the-art natural language processing, 2019. URL https://github.com/huggingface/transformers.
- Soft prompt recovers compressed llms, transferably. In International Conference on Machine Learning. PMLR, 2024.
- Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024.
- Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917, 2024.