Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (2310.07240v6)

Published 11 Oct 2023 in cs.NI and cs.LG
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Abstract: As LLMs take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.

CacheGen: An Approach to KV Cache Compression and Streaming for Efficient LLM Serving

The paper "CacheGen: KV Cache Compression and Streaming for Fast LLM Serving" introduces an innovative approach to address the latency issues in LLM serving systems, particularly focusing on the delays incurred by processing long-context inputs. As LLMs increasingly engage in complex tasks, the requirement to process longer contexts introduces significant latency in generating outputs. This latency challenge prompted the authors to develop CacheGen, a solution designed to enhance the efficiency of context loading in LLM systems by facilitating faster fetching and processing of contexts through optimized KV cache management.

Key Concepts and Methodologies

  1. KV Cache Encoding:
    • CacheGen employs a novel KV cache encoding scheme aimed at mitigating the network delays intrinsic to transferring large tensor-based KV caches. This scheme uses custom quantization and arithmetic coding strategies, leveraging the observed distributional properties of KV caches, particularly token-wise locality and layer-wise sensitivity to data loss.
    • By encoding KV caches into compact bitstream representations, CacheGen significantly reduces bandwidth requirements, thus addressing one of the primary bottlenecks in LLM latency. The encoding process is designed to introduce minimal computational overhead, maintaining system efficiency.
  2. Adaptation to Bandwidth Variations:
    • The streaming module in CacheGen is capable of adapting to fluctuations in available network bandwidth. When bandwidth constraints are detected, the system dynamically adjusts compression levels or opts to compute KV caches from text on-the-fly.
    • This adaptability ensures that the context-loading delay remains within acceptable limits, adhering to service-level objectives without compromising the accuracy or quality of the generated LLM responses.

Experimental Evaluation and Results

The experimental evaluations showcased CacheGen's performance across various LLMs and datasets, demonstrating substantial improvements in time-to-first-token (TTFT) metrics:

  • Performance Metrics: CacheGen reduced the size of KV caches by 3.7-4.3×\times and minimized overall fetching and processing delays by 2.7-3.2×\times compared to recent systems that reuse KV caches. Critically, these improvements were achieved without significant degradation in response quality.
  • Comparison with Baselines: Compared to both text context transmission and basic quantization, CacheGen maintained a superior trade-off between transmission delay reduction and LLM accuracy across diverse workloads and network conditions.

Implications and Future Directions

CacheGen offers considerable practical advantages by optimizing KV cache management, thus facilitating more efficient use of bandwidth and computational resources in LLM serving environments. By addressing the latency challenges associated with long contexts, CacheGen has the potential to enhance user experience by enabling faster and more responsive LLM applications.

The theoretical implications of this work suggest new avenues for engineering KV caches in LLMs, particularly in environments where network bandwidth is variable or constrained. The observations about KV cache characteristics may guide future research in designing even more effective compression algorithms tailored specifically for tensor-based data structures in neural networks.

Looking ahead, potential developments could include integrating CacheGen within broader frameworks for distributed inference, enabling more seamless and cost-effective deployment of LLMs across computational infrastructures. Additionally, exploring CacheGen's compatibility with emergent memory-efficient architectures or investigating its application within multi-tenant LLM platforms could extend its utility and impact.

In conclusion, CacheGen represents a significant advancement in LLM serving systems, providing a robust solution to the persistent challenge of context-induced latency. By articulating an empirical basis for KV cache compression and streaming, it underscores the value of targeted engineering solutions that respect the complex interplay between network dynamics and computational efficiency in modern AI applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. 12 Practical Large Language Model (LLM) Applications - Techopedia. https://www.techopedia.com/12-practical-large-language-model-llm-applications. (Accessed on 09/21/2023).
  2. 2112.04426.pdf. https://arxiv.org/pdf/2112.04426.pdf. (Accessed on 09/21/2023).
  3. [2302.13971] llama: Open and efficient foundation language models. https://arxiv.org/abs/2302.13971. (Accessed on 09/21/2023).
  4. [2304.03442] generative agents: Interactive simulacra of human behavior. https://arxiv.org/abs/2304.03442. (Accessed on 09/21/2023).
  5. 7 top large language model use cases and applications. https://www.projectpro.io/article/large-language-model-use-cases-and-applications/887. (Accessed on 09/21/2023).
  6. Anthropic \ introducing 100k context windows. https://www.anthropic.com/index/100k-context-windows. (Accessed on 09/21/2023).
  7. Applications of large language models - indata labs. https://indatalabs.com/blog/large-language-model-apps. (Accessed on 09/21/2023).
  8. Bard - chat based ai tool from google, powered by palm 2. https://bard.google.com/. Accessed: September 21st, 2023.
  9. Gpt-4 api general availability and deprecation of older models in the completions api. https://openai.com/blog/gpt-4-api-general-availability. (Accessed on 09/21/2023).
  10. langchain-ai/langchain:building applications with llms through composability. https://github.com/langchain-ai/langchain. (Accessed on 09/21/2023).
  11. Real-world use cases for large language models (llms) | by cellstrat | medium. https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2. (Accessed on 09/21/2023).
  12. Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT. (Accessed on 09/21/2023).
  13. Store and reference chat history | langchain. https://python.langchain.com/docs/use_cases/question_answering/how_to/chat_vector_db. (Accessed on 09/21/2023).
  14. How latency affects user engagement. https://pusher.com/blog/how-latency-affects-user-engagement/, 2021. (Accessed on 09/21/2023).
  15. How to train generative ai using your company’s data. https://hbr.org/2023/07/how-to-train-generative-ai-using-your-companys-data, 2021. (Accessed on 09/21/2023).
  16. Best practices for deploying large language models (llms) in production. https://medium.com/@_aigeek/best-practices-for-deploying-large-language-models-llms-in-production-fdc5bf240d6a, 2023. (Accessed on 09/21/2023).
  17. Towards a Human-like Open-Domain Chatbot, 2020.
  18. Accordion: Adaptive gradient communication via critical learning regime identification. arXiv preprint arXiv:2010.16248, 2020.
  19. On the utility of gradient compression in distributed training systems. In D. Marculescu, Y. Chi, and C. Wu, editors, Proceedings of Machine Learning and Systems, volume 4, pages 652–672, 2022.
  20. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for Computing Machinery.
  21. ETC: Encoding Long and Structured Inputs in Transformers, 2020.
  22. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  23. Investigating the relationship between language model perplexity and ir precision-recall measures. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, page 369–370, New York, NY, USA, 2003. Association for Computing Machinery.
  24. Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM ’11, page 242–253, New York, NY, USA, 2011. Association for Computing Machinery.
  25. Longformer: The Long-Document Transformer, 2020.
  26. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
  27. PIQA: Reasoning about Physical Commonsense in Natural Language, 2019.
  28. Improving language models by retrieving from trillions of tokens, 2022.
  29. Distributed Inference and Fine-tuning of Large Language Models Over The Internet, 2023.
  30. Language Models are Few-Shot Learners, 2020.
  31. Reading wikipedia to answer open-domain questions, 2017.
  32. Extending Context Window of Large Language Models via Positional Interpolation, 2023.
  33. Evaluation Metrics For Language Models. 1 2008.
  34. Open question answering over tables and text, 2021.
  35. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  36. Pgvector Contributors. pgvector: Vector similarity search for postgresql. https://github.com/pgvector/pgvector, 2023.
  37. Microsoft Corporation. Bing. https://www.bing.com/, 2009.
  38. Transformer-XL: Language modeling with longer-term dependency, 2019.
  39. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, 2022.
  40. FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference, 2023.
  41. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  42. LongNet: Scaling Transformers to 1,000,000,000 Tokens, 2023.
  43. Self-Agreement: A Framework for Fine-tuning Language Models to Find Agreement among Diverse Opinions, 2023.
  44. Facebook Engineering. Faiss: A library for efficient similarity search, 2017. Accessed: 2023-09-29.
  45. In-context Autoencoder for Context Compression in a Large Language Model. arXiv preprint arXiv:2307.06945, 2023.
  46. Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462, 2020.
  47. General-purpose, long-context autoregressive modeling with perceiver AR. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8535–8558. PMLR, 17–23 Jul 2022.
  48. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering, 2021.
  49. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  50. A Memory Efficient Baseline for Open Domain Question Answering, 2020.
  51. EyeQ: Practical network performance isolation at the edge. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 297–311, Lombard, IL, April 2013. USENIX Association.
  52. Billion-scale similarity search with GPUs, 2017.
  53. Dense passage retrieval for open-domain question answering, 2020.
  54. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
  55. Lessons learned from the chameleon testbed. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 219–233. USENIX Association, July 2020.
  56. KelSolaar. fvvt-kels-utilities. https://github.com/KelSolaar/fvvt-kels-utilities.git, 2023. GitHub repository.
  57. Reformer: The Efficient Transformer, 2020.
  58. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  59. Fast Inference from Transformers via Speculative Decoding. In International Conference on Machine Learning, 2022.
  60. Interactivity in Online Chat: Conversational Contingency and Response Latency in Computer-mediated Communication. Journal of Computer-Mediated Communication, 23(4):201–221, 06 2018.
  61. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  62. How long can open-source llms truly promise on context length?, June 2023.
  63. Decoupled context processing for context augmented language modeling. arXiv preprint arXiv:2210.05758, 2022.
  64. Speciality vs Generality: An Empirical Study on Catastrophic Forgetting in Fine-tuning Foundation Models. arXiv preprint arXiv:2309.06256, 2023.
  65. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  66. Lost in the Middle: How Language Models Use Long Contexts, 2023.
  67. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. arXiv preprint arXiv:2305.17118, 2023.
  68. Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding, 2021.
  69. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
  70. Ignacio Martinez. privategpt. https://github.com/imartinez/privateGPT, 2023.
  71. Practical full resolution learned lossless image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  72. Pointer Sentinel Mixture Models, 2016.
  73. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification. arXiv preprint arXiv:2305.09781, 2023.
  74. Landmark Attention: Random-Access Infinite Context Length for Transformers, 2023.
  75. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
  76. Revealing the importance of semantic retrieval for machine reading at scale, 2019.
  77. OpenAI. GPT-4 Technical Report, 2023.
  78. I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
  79. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
  80. Elasticswitch: Practical work-conserving bandwidth guarantees for cloud computing. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 351–362, New York, NY, USA, 2013. Association for Computing Machinery.
  81. Efficiently Scaling Transformer Inference, 2022.
  82. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  83. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2023.
  84. In-Context Retrieval-Augmented Language Models, 2023.
  85. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery.
  86. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  87. Long-range Language Modeling with Self-retrieval. arXiv preprint arXiv:2306.13421, 2023.
  88. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  89. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  90. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  91. Do long-range language models actually use long-range context? arXiv preprint arXiv:2109.09115, 2021.
  92. Do Long-Range Language Models Actually Use Long-Range Context?, 2021.
  93. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  94. Zilliz Technology. Gptcache. https://github.com/zilliztech/GPTCache, 2023.
  95. LLaMA: Open and Efficient Foundation Language Models, 2023.
  96. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
  97. Attention Is All You Need, 2023.
  98. Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 2614–2627, New York, NY, USA, 2021. Association for Computing Machinery.
  99. Egeria: Efficient dnn training with knowledge-guided layer freezing. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, page 851–866, New York, NY, USA, 2023. Association for Computing Machinery.
  100. Zemi: Learning zero-shot semi-parametric language models from multiple tasks, 2023.
  101. Hi-speed dnn training with espresso: Unleashing the full potential of gradient compression with near-optimal usage strategies. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, page 867–882, New York, NY, USA, 2023. Association for Computing Machinery.
  102. Arithmetic coding for data compression. Commun. ACM, 30(6):520–540, jun 1987.
  103. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  104. Fast Distributed Inference Serving for Large Language Models, 2023.
  105. Memorizing Transformers. In International Conference on Learning Representations, 2022.
  106. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  107. Progressively pretrained dense corpus index for open-domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2803–2815, Online, April 2021. Association for Computational Linguistics.
  108. End-to-end open-domain question answering with. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019.
  109. Lightweight composite re-ranking for efficient keyword search with bert. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 1234–1244, New York, NY, USA, 2022. Association for Computing Machinery.
  110. EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. arXiv preprint arXiv:2308.14352, 2023.
  111. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  112. Big Bird: Transformers for Longer Sequences, 2021.
  113. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  114. {{\{{SHEPHERD}}\}}: Serving {{\{{DNNs}}\}} in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, 2023.
  115. OPT: Open Pre-trained Transformer Language Models, 2022.
  116. Hybrid retrieval-augmented generation for real-time composition assistance, 2023.
  117. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023.
  118. Tensor Ring Decomposition, 2016.
  119. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Yuhan Liu (103 papers)
  2. Hanchen Li (11 papers)
  3. Kuntai Du (14 papers)
  4. Jiayi Yao (15 papers)
  5. Yihua Cheng (28 papers)
  6. Yuyang Huang (14 papers)
  7. Shan Lu (31 papers)
  8. Michael Maire (40 papers)
  9. Henry Hoffmann (21 papers)
  10. Ari Holtzman (39 papers)
  11. Ganesh Ananthanarayanan (14 papers)
  12. Junchen Jiang (39 papers)
  13. Siddhant Ray (6 papers)
  14. Qizheng Zhang (8 papers)
Citations (11)