IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact (2403.01241v2)
Abstract: LLMs excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.
- Towards efficient post-training quantization of pre-trained language models. Advances in Neural Information Processing Systems, 35:1405–1418.
- Accurately computing the log-sum-exp and softmax functions. IMA Journal of Numerical Analysis, 41(4):2311–2330.
- Quantizable transformers: Removing outliers by helping attention heads do nothing. In Thirty-seventh Conference on Neural Information Processing Systems.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2924–2936.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
- Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Loftq: LoRA-fine-tuning-aware quantization for large language models. In The Twelfth International Conference on Learning Representations.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
- Pointer sentinel mixture models. In International Conference on Learning Representations.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
- OpenAI. 2022. Introducing chatgpt.
- The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1525–1534.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Omniquant: Omnidirectionally calibrated quantization for large language models. In The International Conference on Learning Representations.
- Massive activations in large language models. arXiv preprint arXiv:2402.17762.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Ruikang Liu (5 papers)
- Haoli Bai (24 papers)
- Haokun Lin (15 papers)
- Yuening Li (19 papers)
- Han Gao (78 papers)
- Zhengzhuo Xu (16 papers)
- Lu Hou (50 papers)
- Jun Yao (36 papers)
- Chun Yuan (127 papers)