ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (2410.21465v1)

Published 28 Oct 2024 in cs.LG and cs.CL

Abstract: With the widespread deployment of long-context LLMs, there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6$\times$ larger batch sizes and boost throughput by up to 3.04$\times$ on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

References (68)

Authors (9)

Hanshi Sun (9 papers)
Li-Wen Chang (8 papers)
Wenlei Bao (7 papers)
Size Zheng (11 papers)
Ningxin Zheng (15 papers)
Xin Liu (820 papers)
Harry Dong (9 papers)
Yuejie Chi (109 papers)
Beidi Chen (61 papers)

Citations (4)

View on Semantic Scholar

Summary

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

The paper "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference" presents an innovative system designed to enhance the throughput of long-context LLM inference by optimizing key-value (KV) cache storage and access mechanisms. The increasing deployment of LLMs that processes extensive contexts necessitates efficient high-throughput inference strategies to address inherent limitations in existing GPU architectures, especially considering memory constraints and inference latency.

Key Contributions

Efficient Memory Management:
- ShadowKV introduces an approach that partitions the key-value cache into low-rank key caches retained on the GPU while offloading the heavier value caches to CPU memory. This separation addresses the escalating memory footprint tied to sequence length expansions, allowing for larger batch size handling and reduced GPU memory reliance.
Accurate and Minimal KV Pair Selection:
- The paper exemplifies a KV selection strategy that selects minimal sparse KV pairs in real-time to lessen decoding latency without sacrificing the model's accuracy. This optimization arises from the observation that the pre-Rotary Position Embedding (RoPE) keys have inherently low ranks compared to their post-processing counterparts, facilitating significant compression.
Empirical Evaluation and Performance Gains:
- ShadowKV's implementation is extensively evaluated across varied datasets, including RULER, LongBench, and Needle In A Haystack. The results demonstrate the system's capability to support batch sizes up to six times larger, boosting throughput up to 3.04 times on an NVIDIA A100 GPU while preserving accuracy equivalent to full-cache approaches.
Advanced Throughput and Latency Reductions:
- Leveraging CUDA multi-streams for effective parallel processing, ShadowKV reduces latency by overlapping memory transfers, and it incorporates data fetching and on-the-fly cache reconstruction alongside approximate attention calculations.

Implications and Future Directions

ShadowKV sets a precedent for solving the trade-off between memory usage and inference latency, introducing practical benefits for scalable LLM deployment. As LLMs continue to expand their context handling capacities, ShadowKV suggests a modular framework for the LLM community aiming to optimize cache mechanisms further. Future work could explore adaptive cache management strategies that dynamically reallocate resources based on real-time computational loads or developing more sophisticated multi-stream methods to enhance parallelism further. Additionally, exploration into integrating quantization strategies within ShadowKV may provide further memory footprint reductions without compromising computational integrity.

This paper's techniques address pressing issues in the deployment of large-capacity models in real-world applications. The demonstrated efficiency suggests broad potential for adaptations across various domains requiring rapid and extensive information processing, offering a robust foundation for future advancements in AI inference strategies. In conclusion, ShadowKV asserts a well-evidenced case for restructuring memory access strategies, addressing primary bottlenecks in the LLM infrastructure and providing a scalable solution pertinent to the AI field's continuous evolution. The integration of ShadowKV into existing systems promises to align current limitations with technological advancements, establishing groundwork for further sophisticated inconsistencies in AI systems execution.

PDF Markdown

Related Papers

GitHub

GitHub - bytedance/ShadowKV: ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (8 stars)

Tweets

https://twitter.com/papers_anon/status/1851458444265734231

https://twitter.com/Katie_Carter_42/status/1852164065797509293

https://twitter.com/arXivGPT/status/1852085317219565966

https://twitter.com/javaeeeee1/status/1851578909281644691

https://twitter.com/geeknik/status/1853985382230003911