Papers
Topics
Authors
Recent
Search
2000 character limit reached

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Published 5 Feb 2025 in cs.DC and cs.LG | (2502.03589v1)

Abstract: Disaggregated LLM inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization step, and directly performs computations on quantized KV data to approximate and reduce the cost of the expensive matrix-multiplication step. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods.

Summary

  • The paper demonstrates HACK's main contribution is eliminating dequantization overhead by computing directly on quantized KV data during LLM inference.
  • It employs an asymmetric 2-bit stochastic quantization with partitioning to reduce communication delays, achieving up to 70.9% reduction in job completion time.
  • The method integrates with FlashAttention-2 and open-source frameworks to reduce memory usage by up to 33.6%, enhancing inference efficiency.

"HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference" (2502.03589)

Abstract and Motivation

The paper presents HACK, a method aimed at improving the efficiency of disaggregated LLM inference. Traditionally, disaggregated inference separates the computation-heavy prefill stage and the memory-heavy decode stage, potentially causing bottlenecks in KV transmission due to the increased communication overhead associated with long prompts. Existing quantization methods help alleviate these bottlenecks but introduce significant dequantization overhead.

As observed, communication can consume up to 42.2% of the total job completion time (JCT), making it imperative to optimize this process. The key motivation behind HACK is to eliminate the dequantization overhead by operating directly on quantized KV data, thus enabling faster and more efficient matrix multiplication during inference.

Homomorphic Quantization Method

HACK leverages homomorphic quantization principles to facilitate computation directly on quantized KV data. This involves using an asymmetric 2-bit stochastic quantization to reduce quantization error, partitioning matrix elements into segments called partitions to apply a complex quantization system without needing intensive dequantization later.

Practical Application

The homomorphic scheme incorporates a specialized method to approximate true matrix multiplication results (Equation [3][3]), allowing computations using low-precision data without dequantization. This reduces KV data transmission latency, computation time during inference, and memory demand, effectively mitigating the primary bottleneck—network overhead in disaggregated LLM inference. Figure 1

Figure 1

Figure 1: Employing KV quantization across datasets.

Implementation Details

HACK integrates with FlashAttention-2, a memory-efficient attention kernel, modified to employ OpenAI's Triton for executing fused kernels optimized for attention computation during inference. These kernels enable computation using INT8 format to achieve considerable speed-ups.

Kernel Fusion and Data Management

Kernel fusion employed in HACK ensures efficient computation by combining quantization, matrix multiplication, and other operations into singular, streamlined processes, also accommodating the management of quantized KV data structures within the existing vLLM framework. Figure 2

Figure 2: Overview of HACK in disaggregated LLM inference.

Performance Evaluation

End-to-End Time and Scaling

Experiments reveal that HACK can reduce the average JCT by up to 70.9% compared to both baseline disaggregated methods and state-of-the-art KV quantization strategies, demonstrating substantial improvements across different models and datasets. Figure 3

Figure 3: Average JCT across requests for Llama-3.1 70B with varying datasets.

Memory Efficiency

Peak memory usage is reduced by up to 33.6% in long-sequence datasets, showing that HACK effectively manages memory more efficiently than existing quantization techniques, largely due to eliminating unnecessary recomputation and increasing computational accuracy.

Comparison to Existing Methods

HACK is distinguished from other solutions like CacheGen and KVQuant primarily due to its unique handling of quantized data without incurring heavy dequantization costs. While FP4/6/8 methods provide hardware-based acceleration, they fall short in compressing KV size to comparably efficient levels.

Limitations and Future Work

HACK's focus on severe quantization for transmission efficiency reduces accuracy marginally, providing potential for further exploration into lower-bit quantization schemes that might balance accuracy with latency improvements. Additionally, TurboAttention's symmetric quantization presents an interesting comparison in terms of error rates and overhead.

Scalability

Future work could explore integrating CUDA implementations directly to leverage INT4 computations for further performance gains beyond Triton's current INT8 support. Figure 4

Figure 4: Average JCT across requests for different models with Cocktail or arXiv.

Conclusion

Overall, HACK presents a compelling approach to handling disaggregated LLM inference, offering significant reductions in latency and computational overhead through innovative uses of homomorphic quantization. It highlights potential directions for future enhancements in performance and implementation strategies for large-scale models. Figure 5

Figure 5: Average JCT across requests for individual methods with Llama-3.1 70B.

As LLMs continue to grow in size and complexity, methods like HACK will be instrumental in enabling cost-effective inference at scale with the robust computational capabilities provided by direct quantized data manipulation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 15 likes about this paper.