Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QAQ: Quality Adaptive Quantization for LLM KV Cache (2403.04643v2)

Published 7 Mar 2024 in cs.CL

Abstract: The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Existing methods primarily rely on various hypotheses, such as sorting the KV cache based on attention scores for replacement or eviction, to compress the KV cache and improve model throughput. However, heuristics used by these strategies may wrongly evict essential KV cache, which can significantly degrade model performance. In this paper, we propose QAQ, a Quality Adaptive Quantization scheme for the KV cache. We theoretically demonstrate that key cache and value cache exhibit distinct sensitivities to quantization, leading to the formulation of separate quantization strategies for their non-uniform quantization. Through the integration of dedicated outlier handling, as well as an improved attention-aware approach, QAQ achieves up to 10x the compression ratio of the KV cache size with a neglectable impact on model performance. QAQ significantly reduces the practical hurdles of deploying LLMs, opening up new possibilities for longer-context applications. The code is available at github.com/ClubieDong/KVCacheQuantization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  2. Open LLM leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  3. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  4. Language models are few-shot learners, 2020.
  5. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  6. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  8. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  9. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
  10. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  11. Model tells you what to discard: Adaptive kv cache compression for LLMs. arXiv preprint arXiv:2310.01801, 2023.
  12. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
  13. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  14. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. arXiv preprint arXiv:2305.17118, 2023.
  15. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  16. Howard Seltman. Approximations for mean and variance of a ratio. unpublished note, 2012.
  17. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  18. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  19. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  20. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  21. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
  22. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shichen Dong (2 papers)
  2. Wen Cheng (12 papers)
  3. Jiayu Qin (4 papers)
  4. Wei Wang (1793 papers)
Citations (20)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets