CBQ: Cross-Block Quantization for Large Language Models (2312.07950v4)
Abstract: Post-training quantization (PTQ) has played a key role in compressing LLMs with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020a.
- Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7432–7439, Jun 2020b. doi: 10.1609/aaai.v34i05.6239. URL http://dx.doi.org/10.1609/aaai.v34i05.6239.
- Language models are few-shot learners. arXiv: Computation and Language,arXiv: Computation and Language, May 2020.
- Zeroq: A novel zero shot quantization framework. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020. doi: 10.1109/cvpr42600.2020.01318. URL http://dx.doi.org/10.1109/cvpr42600.2020.01318.
- Bridging the accuracy gap for 2-bit quantized neural networks (qnn). arXiv preprint arXiv:1807.06964, 2018.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Mutual: A dataset for multi-turn dialogue reasoning. arXiv preprint arXiv:2004.04494, 2020.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
- Optimal brain compression: A framework for accurate post-training quantization and pruning. Aug 2022.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. Oct 2022a.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022b.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020a.
- Measuring massive multitask language understanding. Cornell University - arXiv,Cornell University - arXiv, Sep 2020b.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518, 2020.
- Accurate post training quantization with small calibration sets. International Conference on Machine Learning,International Conference on Machine Learning, Jul 2021.
- On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
- The bigscience roots corpus: A 1.6tb composite multilingual dataset. Le Centre pour la Communication Scientifique Directe - HAL - Diderot,Le Centre pour la Communication Scientifique Directe - HAL - Diderot, Nov 2022.
- Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023.
- Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
- A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
- Llama: Open and efficient foundation language models.
- Emergent abilities of large language models. Jun 2022a.
- Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022b.
- Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. Nov 2022.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
- Opt: Open pre-trained transformer language models.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Neural architecture search with reinforcement learning. International Conference on Learning Representations,International Conference on Learning Representations, Nov 2016.