ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (2410.21465v1)
Abstract: With the widespread deployment of long-context LLMs, there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6$\times$ larger batch sizes and boost throughput by up to 3.04$\times$ on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Yi: Open foundation models by 01.ai, 2024.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118, 2024.
- Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061, 2024.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Prompt-prompted adaptive structured pruning for efficient llm generation. In First Conference on Language Modeling, 2024a.
- Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024b.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
- Gradient. Llama-3-8b-instruct gradient 4194k (v0.1), 2024. URL https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k.
- Fastdecode: High-throughput gpu-efficient llm serving using heterogeneous pipelines. arXiv preprint arXiv:2403.11421, 2024.
- Teacherlm: Teaching to fish rather than giving the fish, language modeling likewise. arXiv preprint arXiv:2310.19019, 2023.
- Flashdecoding++: Faster large language model inference on gpus. arXiv preprint arXiv:2311.01282, 2023.
- Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024.
- Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
- Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024.
- Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024.
- Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- InfiniGen: Efficient generative inference of large language models with dynamic kv cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
- Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201, 2023.
- Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.
- World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024b.
- Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024c.
- Meta AI. Introducing Llama 3.1, 2024. URL https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 2024-08-21.
- Microsoft. Microsoft bingchat, 2024. URL https://www.bing.com/chat.
- NVIDIA. Cuda toolkit, 2024. URL https://developer.nvidia.com/cuda-toolkit. Accessed: 2024-09-25.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Icd-lm: Configuring vision-language in-context demonstrations by language modeling. arXiv preprint arXiv:2312.10104, 2023.
- Learnable in-context vector for visual question answering. arXiv preprint arXiv:2406.13185, 2024.
- QwenTeam. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Sparq attention: Bandwidth-efficient llm inference. arXiv preprint arXiv:2312.04985, 2023.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
- Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542, 2024.
- Fmint: Bridging human designed and data pretrained models for differential equation foundation model. arXiv preprint arXiv:2404.14688, 2024.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024.
- Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891, 2024a.
- Quest: Query-aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774, 2024b.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- CUTLASS, January 2023. URL https://github.com/NVIDIA/cutlass.
- Prompt2model: Generating deployable models from natural language instructions. arXiv preprint arXiv:2308.12261, 2023.
- Parameter-efficient tuning of large-scale multimodal foundation model. Advances in Neural Information Processing Systems, 36, 2024a.
- Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. arXiv preprint arXiv:2406.17419, 2024b.
- T Wolf. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Configurable foundation models: Building llms from a modular perspective. arXiv preprint arXiv:2409.02877, 2024.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023a.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023b.
- Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018, 2024.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
- No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096, 2024b.
- Cascade inference: Memory bandwidth efficient shared prefix batch decoding, 2024. URL https://flashinfer.ai/2024/02/02/cascade-inference.html. Accessed: 2024-09-25.
- Rhyme-aware chinese lyric generator based on gpt. arXiv preprint arXiv:2408.10130, 2024.
- Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024.
- Pqcache: Product quantization-based kvcache for long context llm inference. arXiv preprint arXiv:2407.12820, 2024a.
- ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024b.
- Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024c.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024d.
- Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024.
- Hanshi Sun (9 papers)
- Li-Wen Chang (8 papers)
- Wenlei Bao (7 papers)
- Size Zheng (11 papers)
- Ningxin Zheng (15 papers)
- Xin Liu (820 papers)
- Harry Dong (9 papers)
- Yuejie Chi (109 papers)
- Beidi Chen (61 papers)