APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models (2402.14866v2)
Abstract: LLMs have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.
- Miguel A Carreira-Perpinán and Yerlan Idelbayev. 2018. “learning-compression” algorithms for neural net pruning. In IEEE CVPR. 8532–8541.
- Progressive darts: Bridging the optimization gap for nas in the wild. IJCV 129 (2021), 638–655.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. NIPS 33 (2020), 18518–18529.
- GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. ICLR (2023).
- Optimal brain compression: A framework for accurate post-training quantization and pruning. NeurIPS 35 (2022), 4475–4488.
- EleutherAI/lm-evaluation-harness: v0.3.0. https://doi.org/10.5281/zenodo.7413426
- SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
- Optimal brain damage. NeurIPS 2 (1989).
- OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272 (2023).
- LLM-FP4: 4-Bit Floating-Point Quantized Transformers. arXiv preprint arXiv:2310.16836 (2023).
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).
- Training language models to follow instructions with human feedback. NeurIPS 35 (2022), 27730–27744.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 1 (2020), 5485–5551.
- PB-LLM: Partially Binarized Large Language Models. arXiv preprint arXiv:2310.00034 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML. 38087–38099.
- Ziyi Guan (4 papers)
- Hantao Huang (7 papers)
- Yupeng Su (4 papers)
- Hong Huang (56 papers)
- Ngai Wong (82 papers)
- Hao Yu (195 papers)