Papers
Topics
Authors
Recent
Search
2000 character limit reached

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Published 7 Oct 2025 in cs.CL | (2510.06175v1)

Abstract: The Key-Value (KV) cache introduces substantial memory overhead during LLM inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

Summary

  • The paper introduces VecInfer, which suppresses key cache outliers using smooth and Hadamard transformations to enable effective low-bit KV cache compression.
  • It integrates an offline-trained vector quantization method that achieves up to 2.7× speedup and an 8.3× reduction in latency during large-batch self-attention.
  • The optimized fused CUDA kernel minimizes memory overhead by combining computation and dequantization, enhancing performance in long-context LLMs.

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Introduction

The paper "VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization" introduces a novel vector quantization (VQ) method known as VecInfer, specifically designed to enhance the efficiency of LLM inference by compressing the key-value (KV) cache without significant performance degradation. The primary challenge addressed by VecInfer is the high memory overhead associated with KV caches during LLM inference, particularly when operating at ultra-low bit-widths. Existing VQ methods suffer from reduced performance due to key cache outliers, which inhibit effective codebook utilization.

VecInfer addresses these challenges by applying smooth and Hadamard transformations to suppress key cache outliers, thereby facilitating a more comprehensive coverage by the codebook. This paper further optimizes inference by deploying a customized CUDA kernel that fuses computation and dequantization to reduce memory access overhead.

Methodology

Transformation Techniques

VecInfer employs dual transformations to refine the distribution of the key cache, thereby minimizing the impact of outliers:

  • Smooth Transformation: This involves channel-wise scaling of the key and an inverse scaling of the query to maintain computational equivalence across queries and keys.
  • Hadamard Transformation: An orthogonal transformation using a Hadamard matrix further redistributes the data, effectively normalizing the outlier impact. Figure 1

Figure 1

Figure 1: Channel-wise distribution of the key cache.

These transformations collectively facilitate the transformation of the key cache into a format that is more amenable to quantization, reducing the difficulty and increasing the efficiency of this process.

Vector Quantization

Post-transformation, VecInfer employs a VQ technique that maps the transformed key vectors to a predefined codebook, which is trained offline using K-Means clustering. By partitioning high-dimensional vectors into sub-vectors and encoding them using a limited bit-width, VecInfer effectively compresses the KV cache. The two-stage process involves:

  1. Prefilling: Applying transformations and quantizing keys and values using the pre-trained codebooks.
  2. Decoding: Online transformation of incoming keys and their concatenation with existing quantized entries. Figure 2

    Figure 2: Typical vector quantization pipeline.

This method significantly reduces memory requirements without compromising the integrity of the data, supporting efficient large-batch self-attention computation.

Hardware Optimization

VecInfer implements a fused dequantization-computation CUDA kernel, which enhances computational efficiency through:

  • Fine-Grained Tiled Computation: Segmentation of the attention computation into manageable tiles to mitigate memory bottlenecks.
  • Asynchronous Pipeline Execution: Utilizing CUDA's memcpy_async API to overlap memory transfers with computation, further reducing latency. Figure 3

    Figure 3: Overview of VecInfer. During inference, dual transformation is applied before vector quantization.

These optimizations ensure that VecInfer achieves superior performance metrics compared to existing methodologies.

Performance Evaluation

The evaluations of VecInfer demonstrate its superiority over existing quantization methods, especially in scenarios involving long-context understanding and mathematical reasoning tasks. The study reveals that VecInfer can achieve up to a 2.7× speedup in large-batch self-attention computations and an 8.3× reduction in single-batch latency using a 2-bit KV cache, while maintaining the performance level comparable to full precision.

Additionally, the architecture's ability to maintain high accuracy at reduced bit-widths suggests that it is particularly effective for models with extensive sequence lengths, like Llama-3.1-8B, which leverage 196k sequence lengths.

Conclusion

VecInfer represents a significant advancement in LLM inference by providing a robust framework for KV cache compression at ultra-low bit-widths without sacrificing performance. Its dual transformation approach is effective in suppressing outliers, and its optimized CUDA kernel ensures efficient deployment on GPU hardware. These innovations collectively offer a compelling solution for deploying LLMs in environments constrained by memory and compute resources, broadening the potential applications of such models. Figure 4

Figure 4: Kernel performance on H100 (80GB). Additional results are provided in Figure 5.

In summary, VecInfer's approach sets a new standard for efficient LLM inference, and its methodologies could potentially be adapted for other types of model compression and optimization tasks within the field of AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.