Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention (2403.19708v3)
Abstract: Interacting with humans through multi-turn conversations is a fundamental feature of LLMs. However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8$\times$ for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Llm in a flash: Efficient large language model inference with limited memory. arXiv preprint arXiv:2312.11514, 2023.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC, 2022.
- AWS. Amazon ec2 p4d pricing. https://aws.amazon.com/ec2/instance-types/p4/.
- AWS. Amazon ec2 pricing. https://aws.amazon.com/ec2/pricing/.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2020.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- An approximate analysis of the lru and fifo buffer replacement schemes. In Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 143–152, 1990.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
- Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Proceedings of Advances in Neural Information Processing Systems, NeuIPS, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
- Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022.
- Frage: Frequency-agnostic word representation. In Proceedings of Advances in Neural Information Processing Systems, NeuIPS, 2022.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Transformer in transformer. In Proceedings of Advances in Neural Information Processing Systems, NeuIPS, 2021.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- InternLM. Lmdeploy. https://github.com/InternLM/lmdeploy.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of ACM Symposium on Operating Systems Principles, SOSP, 2023.
- How long can context length of open-source llms truly promise? In Workshop in Proceedings of Advances in Neural Information Processing Systems, NeuIPS Workshop, 2023.
- Accelerating distributed moe training and inference with lina. In Proceedings of USENIX Annual Technical Conference, ATC, 2023.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. arXiv preprint arXiv:2305.17118, 2023.
- Deja vu: Contextual sparsity for efficient llms at inference time. In Proceedings of International Conference on Machine Learning, ICML, 2023.
- Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.
- Using an llm to help with code understanding. In Proceedings of IEEE/ACM International Conference on Software Engineering, ICSE, 2024.
- NVIDIA. Fastertransformer. https://github.com/NVIDIA/FasterTransformer.
- NVIDIA. Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl.
- OpenAI. https://openai.com/blog/chatgpt, 2024.
- OpenAI. https://platform.openai.com/docs/assistants/how-it-works/managing-threads-and-messages, 2024.
- Pytorch: An imperative style, high-performance deep learning library. In Proceedings of Advances in Neural Information Processing Systems, NeuIPS, 2019.
- Splitwise: Efficient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, MLSys, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 2020.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In Proceedings of International Conference on Machine Learning, ICML, 2022.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC, 2021.
- ShareGPT. Sharegpt. https://sharegpt.com/.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of International Conference on Machine Learning, ICML, 2023.
- Powerinfer: Fast large language model serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456, 2023.
- Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Fast and exact analysis for lru caches. Proceedings of the ACM on Programming Languages, 3(POPL):1–29, 2019.
- Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, NeuIPS, 2017.
- vLLM Project. vllm: Easy, fast, and cheap llm serving with pagedattention. https://github.com/vllm-project/vllm/.
- Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691, 2023.
- Tabi: An efficient multi-level inference system for large language models. In Proceedings of the European Conference on Computer Systems, EuroSys, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. arXiv preprint arXiv:2402.15220, 2024.
- Orca: A distributed serving system for transformer-based generative models. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2022.
- Stateful large language model serving with pensieve. arXiv preprint arXiv:2312.05516, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129, 2019.
- H _2_2\_2_ 2 o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023.
- Bin Gao (88 papers)
- Zhuomin He (2 papers)
- Puru Sharma (2 papers)
- Qingxuan Kang (3 papers)
- Djordje Jevdjic (7 papers)
- Junbo Deng (2 papers)
- Xingkun Yang (2 papers)
- Zhou Yu (206 papers)
- Pengfei Zuo (15 papers)