FDC: Fast KV Dimensionality Compression for Efficient LLM Inference (2408.04107v3)
Abstract: In large-LLMs, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose FDC, a fast KV dimensionality compression system that eliminates the decompression overhead incurred in the existing KV dimensionality compression system, Palu, and reduces attention time. Moreover, FDC employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, FDC enhances the attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that compared to Palu, FDC can reduce Job Completion Time (JCT) by up to 64%, and delivers up to 1.97X throughput under the same latency, while maintaining 99% of the accuracy without compression. When state-of-the-art eviction and quantization methods are combined with FDC, they exhibit similar improvements compared to those combined with Palu. We open-sourced the code.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.