Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM (2403.05527v4)

Published 8 Mar 2024 in cs.LG, cs.AI, and cs.CL
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for LLMs inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.

Key–value (KV) caching is the main speed-up technique for autoregressive inference with LLMs, but its memory cost grows linearly with sequence length, quickly becoming the dominant bottleneck on GPUs and PCI-e bandwidth. “GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM” proposes a lightweight, on-the-fly compression method that shrinks KV memory up to ≈3× while preserving generation quality, even on reasoning-heavy tasks where prior token-dropping or uniform quantization methods fail.


1 Problem setting

  • For a 30 B-parameter LLM with sequence length 1024 and batch 128, KV cache can exceed 180 GB.
  • Existing approaches
    • Token dropping: remove “unimportant” tokens using attention statistics. Works for short or extractive tasks, but loses essential context in long reasoning chains.
    • Uniform / groupwise quantization: compress every entry to 8-bit or 4-bit. With many outlier values the quantization error accumulates autoregressively, derailing generation.

Complex tasks (GSM8k, MMLU, BBH) expose this error accumulation; 4-bit uniform schemes collapse to near-zero accuracy.


2 GEAR compression recipe

GEAR approximates each KV matrix X (either K or V per layer) by the sum of three complementary components:

1
2
3
X  ≈  Q  +  L  +  S
     |     |     |
 4-bit quant  low-rank   sparse outliers

  1. Uniform Quantization (Q) ≈98 % of entries, 4-bit per value. Hardware-friendly INT4/INT8 path.
  2. Low-rank residual (L) Compute residual R = X − Q − S; obtain top-r singular components via one-step power iteration (SVDSolver_r). Typical rank r = 2 – 5 % × min(n,d) (e.g., 5-10 for n=2048, d=4096). Captures coherent residual structure shared across tokens.
  3. Sparse correction (S) Extract the largest s % positive and negative outliers (default s = 2 %). Stored as 16-bit values + INT32 indices.

The three pieces remove different error modes: Q handles bulk similarity, L captures low-frequency token-wise patterns, S patches extreme per-cell deviations. An ablation shows dropping either L or S degrades GSM8k accuracy from 15.7 % to ≈2 %.

Streaming implementation

To hide the latency of SVD and sparse packing, GEAR buffers the latest b tokens (default b = 20) in FP16. Compression is performed only when the buffer is full, amortising extra kernels and keeping additional memory below 2 %.


3 Practical algorithm

1
2
3
4
5
6
7
8
For each generation step t:
  • Append new k_t, v_t to small FP16 buffer B
  • If |B| == b:
        For both K and V:
           S = top/bottom s% entries of (K∥B)
           Q = Quant4( (K∥B) - S )
           L = PowerIter_SVD_r( (K∥B) - Q - S )
           Store compressed (Q,L,S); clear buffer

Compression ratio ≈ 1 / (b/16 + 3s + (n+d)/(max(n,d)) ρ) (ρ = r/min(n,d)). With b=4, s=0.02, ρ=0.02 → ≈3 × size reduction.


4 Experiments

Models: LLaMA-2 7 B, LLaMA-2 13 B, Mistral-7 B (FP16 weights). Tasks: GSM8k, MMLU, BBH, WikiText-2; Chain-of-Thought (CoT) and zero-shot variants.

Method @4-bit KV ratio GSM8k-CoT (7B) MMLU-CoT (7B) GSM8k-CoT (13B)
Uniform 0 % 0 % 0 %
Group-wise 1.4 % 11.9 % 2.1 %
Outlier-red (s = 10 %) 1.8× 11.2 % 40.7 % 21.3 %
GEAR (s = 2 %, ρ = 5 %) 2.6× 15.7 % 44.5 % 28.0 %
  • GEAR is within <0.5 % absolute of FP16 on most datasets.
  • Average accuracy gain vs best baseline: +5.8 % at 2.6× compression.
  • On zero-shot GSM8k (7B-chat) accuracy drops only 0.4 pp versus FP16.

System-level: evaluated on Zig-Zag GPU/CPU offloading scheduler (RTX Titan 24 GB). 4-bit GEAR raises tokens/s throughput 2.38× and peak GPU memory is cut 2.29×, enabling 1.7× larger batch sizes or context windows.


5 Implementation notes

  • Quantization: standard per-tensor INT4 with de/scale on-chip; can reuse INT8 kernels.
  • Sparse outliers: store int32 row/col + fp16 value; CSR faster than COO.
  • Low-rank: single power-iteration (2-3 GEMMs) per compress cycle; negligible for b=20.
  • Works as drop-in wrapper around HuggingFace generate() loop; reference PyTorch code released.

6 Ablations and insights

  • Compression ratio sweep shows GEAR still near-lossless at 3.4×; baselines collapse.
  • GEAR vs token dropping (H₂O): on GSM8k 50 % drop → 6.8 % Acc; GEAR 3× → 14.2 %.
  • Fine-tuned GSM8k model: GEAR (ρ = 10 %) recovers 97 % of FP16 accuracy.
  • Rank and sparsity sensitivity: sweet spot around s = 1–2 %, ρ = 2–5 %.

7 Limitations & future work

  • Additional index storage means effective ratio plateaus beyond ≈4×.
  • Assumes residual error is low-rank; may be less effective on highly entropic activations (e.g., early training checkpoints).
  • Custom fused kernels for (Q+L+S)·Qᵗ attention could further reduce compute overhead.

8 Takeaways for practitioners

  1. Plug-in GEAR to inference loops to cut KV memory 2-3× with <1 % quality loss on reasoning tasks.
  2. Use s ≈ 2 %, ρ ≈ 0.02, buffer b ≈ 20 as safe defaults.
  3. Combine with 8-bit or 4-bit weight quantization and offloading frameworks (FlexGen/DeepSpeed-Z3) for maximal GPU memory savings.
  4. Minimal engineering: only INT4 quantization and one small SVD per buffer flush; no retraining, no custom CUDA required.

The paper demonstrates that carefully mixing quantization, low-rank, and sparse corrections provides a sweet spot—achieving “near-lossless” generation accuracy while significantly accelerating and scaling LLM inference workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33 1877–1901.
  3. Language models are few-shot learners. CoRR, abs/2005.14165. https://arxiv.org/abs/2005.14165
  4. Training verifiers to solve math word problems.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness.
  6. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  7. Gptq: Accurate post-training quantization for generative pre-trained transformers.
  8. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
  9. Model tells you what to discard: Adaptive kv cache compression for llms.
  10. Measuring massive multitask language understanding.
  11. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition.
  12. Mistral 7b.
  13. Squeezellm: Dense-and-sparse quantization.
  14. Loftq: Lora-fine-tuning-aware quantization for large language models.
  15. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=JZfg6wGi6g
  16. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.
  17. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  18. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564.
  19. OpenAI (2023). Gpt-4 technical report.
  20. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox and R. Garnett, eds.).
  21. Efficiently scaling transformer inference.
  22. Sparq attention: Bandwidth-efficient llm inference.
  23. Flexgen: High-throughput generative inference of large language models with a single gpu.
  24. Challenging big-bench tasks and whether chain-of-thought can solve them.
  25. Lamda: Language models for dialog applications.
  26. Llama: Open and efficient foundation language models.
  27. Llama 2: Open foundation and fine-tuned chat models.
  28. Attention is all you need. Advances in neural information processing systems, 30.
  29. Powersgd: Practical low-rank gradient compression for distributed optimization. CoRR, abs/1905.13727. http://arxiv.org/abs/1905.13727
  30. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification. https://openreview.net/forum?id=yzkSU5zdwD
  31. Chain-of-thought prompting elicits reasoning in large language models.
  32. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  33. Smoothquant: Accurate and efficient post-training quantization for large language models.
  34. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35 27168–27183.
  35. Wordcraft: Story writing with large language models. In 27th International Conference on Intelligent User Interfaces. IUI ’22, Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3490099.3511105
  36. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS). IEEE. http://dx.doi.org/10.1109/EMC2-NIPS53020.2019.00016
  37. Opt: Open pre-trained transformer language models.
  38. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models.
  39. Atom: Low-bit quantization for efficient and accurate llm serving.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hao Kang (33 papers)
  2. Qingru Zhang (15 papers)
  3. Souvik Kundu (76 papers)
  4. Geonhwa Jeong (12 papers)
  5. Zaoxing Liu (14 papers)
  6. Tushar Krishna (87 papers)
  7. Tuo Zhao (131 papers)
Citations (44)