QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks (2402.04396v2)
Abstract: Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.
- The falcon series of open language models, 2023.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv: 2401.10774, 2024.
- Model preserving compression for neural networks. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=gt-l9Hu2ndd.
- QuIP: 2-bit quantization of large language models with guarantees. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xrk9g5vcXR.
- What is the fast fourier transform? Proceedings of the IEEE, 55(10):1664–1674, 1967. doi: 10.1109/PROC.1967.5957.
- Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- The case for 4-bit precision: k-bit inference scaling laws. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 7750–7774. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/dettmers23a.html.
- Extreme compression of large language models via additive quantization, 2024.
- Fino and Algazi. Unified matrix treatment of the fast walsh-hadamard transform. IEEE Transactions on Computers, C-25(11):1142–1146, 1976. doi: 10.1109/TC.1976.1674569.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Gray, R. Vector quantization. IEEE ASSP Magazine, 1(2):4–29, 1984. doi: 10.1109/MASSP.1984.1162229.
- Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- Hadamard Matrices and Their Applications. The Annals of Statistics, 6(6):1184 – 1238, 1978. doi: 10.1214/aos/1176344370. URL https://doi.org/10.1214/aos/1176344370.
- Mixtral of experts, 2024.
- Multiple stage vector quantization for speech coding. In ICASSP ’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 7, pp. 597–600, 1982. doi: 10.1109/ICASSP.1982.1171604.
- Adam: A method for stochastic optimization, 2017.
- Awq: Activation-aware weight quantization for llm compression and acceleration, 2023.
- Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982. doi: 10.1109/TIT.1982.1056489.
- Up or down? Adaptive rounding for post-training quantization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7197–7206. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/nagel20a.html.
- Overcoming oscillations in quantization-aware training. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16318–16330. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nagel22a.html.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. 2023.
- Code llama: Open foundation models for code, 2024.
- Omniquant: Omnidirectionally calibrated quantization for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8Wuvhh0LYW.
- Sloane, N. Hadamard Matrices — neilsloane.com. http://neilsloane.com/hadamard/. [Accessed 02-02-2024].
- A simple and effective pruning approach for large language models. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=tz9JV2PRSv.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Viazovska, M. The sphere packing problem in dimension 8888. Annals of Mathematics, 185(3), May 2017. ISSN 0003-486X. doi: 10.4007/annals.2017.185.3.7. URL http://dx.doi.org/10.4007/annals.2017.185.3.7.