Create a Video View Paper

TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

This presentation explores TurboAngle, a calibration-free method for compressing transformer KV caches that achieves near-lossless quality by exploiting angle distribution uniformity in the FWHT domain. We examine how uniform angle quantization combined with strategic per-layer boosting and asymmetric K/V treatment delivers competitive compression rates without the calibration overhead required by existing methods, while revealing surprising layer sensitivity patterns that challenge conventional assumptions about quantization behavior.

Script

Large language models store enormous key-value caches during inference, and compressing them without losing quality has remained an expensive calibration problem. Until now.

The authors discovered that a simple random rotation followed by the Fast Walsh-Hadamard Transform produces angles that distribute uniformly on the unit circle. This uniformity is the key: it means you can quantize angles with a fixed grid and achieve near-optimal compression without ever calibrating on real data.

Where methods like TurboQuant and KVQuant need calibration runs to learn per-channel statistics, TurboAngle eliminates that entire pipeline. On Mistral 7B, it delivers a perplexity increase of just 0.001 at 3 angle bits per element—14 times better than TurboQuant at 4 bits.

Not all layers tolerate compression equally. Early transformer layers are bottlenecks, requiring 4 to 5 bits while later layers handle 2 bits easily. Even more striking: the researchers found that key cache norms are 10 to 20 times more sensitive than value norms, and in phi 1.5, boosting certain mid-layers actually degraded quality—a finding that challenges the assumption that more precision always helps.

Why does this work? The Fast Walsh-Hadamard Transform combined with a random plus-minus-1 diagonal rotation forces consecutive element pairs into a spherically symmetric distribution. As dimensionality grows, the angles become provably uniform on the unit circle, which is exactly the condition for optimal fixed-grid quantization. No learned codebooks, no channel statistics—just mathematical invariance.

The results are sharp. TurboAngle hits zero perplexity degradation on four architectures and stays under 0.002 on two others. When you add 8-bit linear quantization for key norms and 4-bit log quantization for value norms, the aggregate rate drops to 6.56 bits on Mistral, with a perplexity cost of just 0.0014. This is near-lossless compression at production scale, delivered without a single calibration sample.

TurboAngle proves that the right mathematical transformation can eliminate calibration entirely while preserving quality that other methods can only approximate. Visit EmergentMind.com to explore the paper in depth and create your own research video.