Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (2410.21465v1)

Published 28 Oct 2024 in cs.LG and cs.CL

Abstract: With the widespread deployment of long-context LLMs, there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6$\times$ larger batch sizes and boost throughput by up to 3.04$\times$ on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Yi: Open foundation models by 01.ai, 2024.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  5. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118, 2024.
  6. Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061, 2024.
  7. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  9. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  10. Prompt-prompted adaptive structured pruning for efficient llm generation. In First Conference on Language Modeling, 2024a.
  11. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024b.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  13. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
  14. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
  15. Gradient. Llama-3-8b-instruct gradient 4194k (v0.1), 2024. URL https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k.
  16. Fastdecode: High-throughput gpu-efficient llm serving using heterogeneous pipelines. arXiv preprint arXiv:2403.11421, 2024.
  17. Teacherlm: Teaching to fish rather than giving the fish, language modeling likewise. arXiv preprint arXiv:2310.19019, 2023.
  18. Flashdecoding++: Faster large language model inference on gpus. arXiv preprint arXiv:2311.01282, 2023.
  19. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024.
  20. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
  21. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
  22. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024.
  23. Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024.
  24. Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.
  25. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  26. InfiniGen: Efficient generative inference of large language models with dynamic kv cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024.
  27. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  28. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201, 2023.
  29. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.
  30. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
  31. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024b.
  32. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024c.
  33. Meta AI. Introducing Llama 3.1, 2024. URL https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 2024-08-21.
  34. Microsoft. Microsoft bingchat, 2024. URL https://www.bing.com/chat.
  35. NVIDIA. Cuda toolkit, 2024. URL https://developer.nvidia.com/cuda-toolkit. Accessed: 2024-09-25.
  36. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  37. Icd-lm: Configuring vision-language in-context demonstrations by language modeling. arXiv preprint arXiv:2312.10104, 2023.
  38. Learnable in-context vector for visual question answering. arXiv preprint arXiv:2406.13185, 2024.
  39. QwenTeam. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/.
  40. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  41. Sparq attention: Bandwidth-efficient llm inference. arXiv preprint arXiv:2312.04985, 2023.
  42. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  43. Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542, 2024.
  44. Fmint: Bridging human designed and data pretrained models for differential equation foundation model. arXiv preprint arXiv:2404.14688, 2024.
  45. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  46. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024.
  47. Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891, 2024a.
  48. Quest: Query-aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774, 2024b.
  49. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  50. CUTLASS, January 2023. URL https://github.com/NVIDIA/cutlass.
  51. Prompt2model: Generating deployable models from natural language instructions. arXiv preprint arXiv:2308.12261, 2023.
  52. Parameter-efficient tuning of large-scale multimodal foundation model. Advances in Neural Information Processing Systems, 36, 2024a.
  53. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. arXiv preprint arXiv:2406.17419, 2024b.
  54. T Wolf. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  55. Configurable foundation models: Building llms from a modular perspective. arXiv preprint arXiv:2409.02877, 2024.
  56. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023a.
  57. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023b.
  58. Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018, 2024.
  59. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  60. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096, 2024b.
  61. Cascade inference: Memory bandwidth efficient shared prefix batch decoding, 2024. URL https://flashinfer.ai/2024/02/02/cascade-inference.html. Accessed: 2024-09-25.
  62. Rhyme-aware chinese lyric generator based on gpt. arXiv preprint arXiv:2408.10130, 2024.
  63. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024.
  64. Pqcache: Product quantization-based kvcache for long context llm inference. arXiv preprint arXiv:2407.12820, 2024a.
  65. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024b.
  66. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024c.
  67. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024d.
  68. Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Hanshi Sun (9 papers)
  2. Li-Wen Chang (8 papers)
  3. Wenlei Bao (7 papers)
  4. Size Zheng (11 papers)
  5. Ningxin Zheng (15 papers)
  6. Xin Liu (820 papers)
  7. Harry Dong (9 papers)
  8. Yuejie Chi (109 papers)
  9. Beidi Chen (61 papers)
Citations (4)

Summary

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

The paper "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference" presents an innovative system designed to enhance the throughput of long-context LLM inference by optimizing key-value (KV) cache storage and access mechanisms. The increasing deployment of LLMs that processes extensive contexts necessitates efficient high-throughput inference strategies to address inherent limitations in existing GPU architectures, especially considering memory constraints and inference latency.

Key Contributions

  1. Efficient Memory Management:
    • ShadowKV introduces an approach that partitions the key-value cache into low-rank key caches retained on the GPU while offloading the heavier value caches to CPU memory. This separation addresses the escalating memory footprint tied to sequence length expansions, allowing for larger batch size handling and reduced GPU memory reliance.
  2. Accurate and Minimal KV Pair Selection:
    • The paper exemplifies a KV selection strategy that selects minimal sparse KV pairs in real-time to lessen decoding latency without sacrificing the model's accuracy. This optimization arises from the observation that the pre-Rotary Position Embedding (RoPE) keys have inherently low ranks compared to their post-processing counterparts, facilitating significant compression.
  3. Empirical Evaluation and Performance Gains:
    • ShadowKV's implementation is extensively evaluated across varied datasets, including RULER, LongBench, and Needle In A Haystack. The results demonstrate the system's capability to support batch sizes up to six times larger, boosting throughput up to 3.04 times on an NVIDIA A100 GPU while preserving accuracy equivalent to full-cache approaches.
  4. Advanced Throughput and Latency Reductions:
    • Leveraging CUDA multi-streams for effective parallel processing, ShadowKV reduces latency by overlapping memory transfers, and it incorporates data fetching and on-the-fly cache reconstruction alongside approximate attention calculations.

Implications and Future Directions

ShadowKV sets a precedent for solving the trade-off between memory usage and inference latency, introducing practical benefits for scalable LLM deployment. As LLMs continue to expand their context handling capacities, ShadowKV suggests a modular framework for the LLM community aiming to optimize cache mechanisms further. Future work could explore adaptive cache management strategies that dynamically reallocate resources based on real-time computational loads or developing more sophisticated multi-stream methods to enhance parallelism further. Additionally, exploration into integrating quantization strategies within ShadowKV may provide further memory footprint reductions without compromising computational integrity.

This paper's techniques address pressing issues in the deployment of large-capacity models in real-world applications. The demonstrated efficiency suggests broad potential for adaptations across various domains requiring rapid and extensive information processing, offering a robust foundation for future advancements in AI inference strategies. In conclusion, ShadowKV asserts a well-evidenced case for restructuring memory access strategies, addressing primary bottlenecks in the LLM infrastructure and providing a scalable solution pertinent to the AI field's continuous evolution. The integration of ShadowKV into existing systems promises to align current limitations with technological advancements, establishing groundwork for further sophisticated inconsistencies in AI systems execution.

Github Logo Streamline Icon: https://streamlinehq.com