Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values (2502.15075v2)
Abstract: LLMs have introduced significant advancements to the capabilities of NLP in recent years. However, as these models continue to scale in size, memory constraints pose substantial challenge. Key and Value cache (KV cache) quantization has been well-documented as a promising solution to this limitation. In this work, we provide two novel theorems aimed at enhancing KV quantization methods. Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information compared to the value weight matrices, as evidenced by higher spectral and Frobenius norms across most of the layers. Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance. In particular, assigning greater precision to the keys compared to the values achieves a higher degree of precision reduction with minimal impact on model accuracy. We validate these theorems through theory and extensive experiments on several state-of-the-art LLM architectures and benchmarks. These findings offer valuable guidelines for improving KV cache quantization strategies, facilitating more efficient memory utilization without compromising model performance across diverse NLP tasks. Source code is available at https://github.com/mohsenhariri/spectral-kv.