HACK: Homomorphic KVCache Quantization
- Homomorphic KVCache Quantization (HACK) is a class of methods that directly process compressed key-value caches in transformer models to optimize memory and speed.
- It employs techniques such as direct homomorphic computation, commutative encoding with RoPE, and adaptive quantization to eliminate dequantization overhead.
- Experimental results demonstrate up to 70.9% faster inference and 8-16× memory savings with minimal accuracy degradation in large language models.
Homomorphic KVCache Quantization (HACK) encompasses a class of methodologies that enable direct computation on compressed, quantized key–value (KV) caches in transformer-based LLMs, with the goal of achieving substantial memory and throughput gains while maintaining accuracy. Distinct from traditional quantization approaches that require explicit dequantization before each matrix operation, HACK aims for “homomorphic” properties: the compressed format supports efficient, loss-minimized computation within the model—eliminating costly decompression overhead and enabling more scalable and resource-efficient inference.
1. Motivation and Conceptual Foundations
The memory demands of KV cache storage in LLMs scale linearly with context length and batch size, rapidly exceeding hardware constraints, especially in disaggregated inference architectures or on resource-constrained devices. Early approaches to KV cache quantization focused on uniform per-token or per-channel bit reduction, but these often required dequantization before each attention operation, causing computation and latency bottlenecks (Yue et al., 19 Feb 2024, Yang et al., 28 Feb 2024, Zhang et al., 5 Feb 2025). The HACK paradigm is defined by:
- Performing attention and other transformer computations directly in the quantized domain, avoiding repeated memory expansion and computation associated with dequantization steps.
- Designing quantized representations and quantization operators that closely preserve the algebraic and information-theoretic relationships required by self-attention and other downstream operations.
- Achieving homomorphism in practice: ensuring that quantized operations (e.g., matrix multiplication, attention) approximate their full-precision counterparts within acceptable error bounds and with controllable error propagation.
These goals reflect both theoretical ambitions (maintain operation commutativity and linearity where possible) and practical imperatives (speed, throughput, memory savings) (Zhang et al., 5 Feb 2025, Zhang et al., 7 May 2024, Li et al., 23 Jun 2025).
2. Methodological Principles and Quantization Schemes
State-of-the-art HACK methods employ several families of techniques, often in combination:
a. Direct Homomorphic Computation
HACK (Zhang et al., 5 Feb 2025) partitions quantized queries (Q), keys (K), and values (V) into subblocks, applies an asymmetric 2-bit stochastic quantization, and approximates the matrix multiplication required by attention with a formula operating directly on quantized ints:
where scaling (), minimum (), and quantization () are determined per-partition. The attention kernel is implemented to work natively in the quantized domain, removing dequantization as an inference bottleneck.
b. Commutativity and Structure-Preserving Encodings
Techniques such as CommVQ (Li et al., 23 Jun 2025) employ additive vector quantization with codebooks designed to be commutative with rotary position embeddings (RoPE). For 2D RoPE with block diagonal rotation matrices and codebook (e.g., ), the design ensures , allowing integrated quantized attention computation. This commutation property is fundamental for enabling quantized representations to support core transformer operations without fidelity losses induced by inconsistent transformations between the quantized space and position encodings.
c. Statistical and Structural Adaptation
Several methods enhance homomorphic quantization by aligning transformation and quantization steps with the distributional or structural properties of the activations:
- 2D-Quantization (e.g., WKVQuant (Yue et al., 19 Feb 2024)): Applies channel-wise smoothing and dynamic token-wise scaling, such that the transformation
plus
ensures quantized values remain tightly coupled to full-precision outputs under further transformer block propagation, aided by loss functions such as
for optimized global alignment.
- Mixed-Precision Quantization and Outlier Reservation (Yang et al., 28 Feb 2024, Su et al., 16 May 2025): “Important” KV pairs are stored at higher precision, while others are quantized aggressively, with dynamic detection and full-precision retention of outlier tokens that would otherwise disproportionately inflate quantization error.
d. SVD and Latent Decomposition
SVDq (Yankun et al., 21 Feb 2025) projects the key cache into a latent SVD basis and applies importance-aware mixed precision quantization to the resulting channels. Due to fast decay of singular values, higher precision is reserved for high-energy latent channels, with theoretical guarantees that quantization error is exponentially reduced compared to uniform per-channel quantization in the original space, enabling compression ratios exceeding with minimal performance loss.
3. Experimental Benchmarks and Performance Impact
HACK and related techniques have been evaluated across a variety of models (Llama-3.1 70B, Mistral-v0.3 7B, Falcon 180B, etc.) and datasets (IMDb, arXiv summaries, Cocktail, HumanEval):
- Direct homomorphic computation (Zhang et al., 5 Feb 2025) reduces job completion time (JCT) in disaggregated inference settings by over uncompressed baselines and over prior state-of-the-art KV quantization, largely by eliminating KV dequantization time and reducing communication overhead.
- CommVQ (Li et al., 23 Jun 2025) achieves up to KV cache memory reduction (2-bit quantization) and enables 1-bit quantization for inference with less than accuracy degradation on challenging reasoning and long-context benchmarks. LLaMA-3.1 8B is demonstrated with $128K$ context length on a single consumer GPU (RTX 4090).
- WKVQuant (Yue et al., 19 Feb 2024) attains memory savings nearly matching those of full weight-activation quantization while maintaining accuracy similar to weight-only quantized models, validating the selective quantization strategy.
- SVDq (Yankun et al., 21 Feb 2025) yields effective 1.25-bit average precision and over key cache compression with negligible accuracy loss on RULER and LongBench, due to channel prioritization.
- Outlier-aware and mixed-precision schemes (Su et al., 16 May 2025, Yang et al., 28 Feb 2024) achieve reduction in memory with up to throughput increase, mainly by maintaining fidelity for the most critical or anomalous tokens.
A summary table highlights representative quantitative advances:
| Method | Compression Ratio | Throughput Gain | Accuracy Degradation |
|---|---|---|---|
| HACK | up to faster JCT | (2-bit) | |
| CommVQ | (2-bit), (1-bit) | Model-dependent | (1-bit) |
| SVDq | (keys, with sparsity) | Model-dependent | negligible |
| Outlier-aware | negligible - 12%+ gain on some tasks |
4. Implementation Considerations and System Integration
A robust HACK pipeline requires careful system-level consideration to maintain its theoretical and empirical advantages:
- HACK (Zhang et al., 5 Feb 2025) partitions both quantized queries and K/V matrices into subblocks for joint quantization, and caches intermediate summations during attention computation to compensate for nonlinearity introduced by compression.
- CommVQ (Li et al., 23 Jun 2025) ensures that the learned codebook is compatible with RoPE by constraining its blocks and optimizing with EM; this compatibility is a prerequisite for avoiding extra projection or runtime conversion steps during inference.
- These methods can be integrated into mainstream inference frameworks (e.g., vLLM, FlashAttention-2 kernels) and are compatible with modular deployment in both disaggregated and monolithic architectures.
- For practical deployment, block sizes, quantization bit-width, retention proportion for outliers or high-precision tokens, and codebook parameters must be calibrated to match hardware capabilities and tolerance for accuracy degradation.
- Current HACK implementations often rely on two-bit representations, with associated gains; proposed advancements include more hardware-native low-bit support (e.g., true INT4 kernels) or adaptive precision scheduling.
5. Practical Applications and Limitations
Homomorphic KVCache Quantization is particularly effective in contexts where:
- Large context lengths and batch sizes are indispensable (retrieval-augmented generation, summarization, conversation memory, code completion).
- Memory bandwidth and compute latency are as limiting as raw compute power, especially in GPU-poor or edge deployments.
- Disaggregated inference pipelines, where the prefill and decode stages are separated across different hardware (possibly with unequal network bandwidth), benefit from HACK's ability to compress and transmit KV cache data efficiently, then perform attention in compressed form (Zhang et al., 5 Feb 2025).
- Multimodal and visual autoregressive models can apply head-aware asymmetric compression strategies, selecting pattern-specific methods for different head types to maximize both spatial and semantic fidelity (Qin et al., 12 Apr 2025).
However, the homomorphic approach is subject to certain limitations:
- Homomorphic quantization effectiveness depends on the accuracy of quantized attention relative to full-precision. Some information loss is inevitable, and for high-sensitivity or near-lossless needs, careful calibration of outlier and token retention strategy is critical.
- Methods relying on block or channel structure (e.g., SVDq, commutative codebooks) require additional preprocessing and/or codebook learning epochs, and encode/decode logic increases implementation complexity.
- Scaling to even lower precisions (e.g., sub-1-bit) may cause numerical instability, particularly if codebook or scaling choices are not robust to non-Gaussian or highly multimodal activation distributions.
6. Extensions, Research Directions, and Theoretical Implications
HACK methods and their variants stimulate ongoing research into:
- Alternative quantization schemes (beyond stochastic quantization, including vector and product quantization) and their hardware co-design, potentially leveraging future accelerator support.
- Error-adaptive, context- or content-aware allocation of bit-width and full-precision tokens—possibly in real-time, responding to evolving model activation statistics during inference.
- Integration with algorithmic techniques such as token pruning, compressed attention, head importance estimation, or SVD-based redundancy removal to increase effective compression at minimal accuracy loss.
- Extensions to privacy-preserving inference, including exploring the interplay between homomorphic encryption and homomorphic quantization, with the goal of enabling both efficient and secure computation on compressed, potentially encrypted KV caches (Yankun et al., 21 Feb 2025).
- Mathematical exploration of the information flow and error propagation induced by homomorphic quantization in deep transformer stacks, possibly inspiring new quantization-compatible model architectures and attention mechanisms.
7. Summary
Homomorphic KVCache Quantization (HACK) defines a class of methods enabling transformer inference to operate directly on low-bit compressed KV caches, thereby eliminating dequantization bottlenecks, accelerating attention computation, and substantially reducing the memory footprint. State-of-the-art HACK approaches achieve this by aligning quantization and codebook design with model structure (e.g., attention patterns, position embedding, channel dependencies), leveraging adaptive and mixed-precision quantization, and, where possible, exploiting algebraic properties such as commutativity with position encodings. These methods deliver up to memory savings at 1-2 bit precision with minimal or controllable loss of accuracy, significantly improving JCT and throughput in both standard and disaggregated inference pipelines. While challenges remain—particularly in ultra-low precision regimes and in robust error management—HACK frameworks represent a central advance in making LLMs more efficient and scalable in real-world deployment scenarios.