Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2309.02784v2)
Abstract: As the size of LLMs continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432–7439.
- Language models are few-shot learners. In Conference on Neural Information Processing Systems (NeurIPS).
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135.
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339.
- Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360.
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- A framework for few-shot language model evaluation.
- Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks.
- Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop.
- Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
- The BigScience Corpus: A 1.6 TB Composite Multilingual Dataset.
- LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888.
- LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv:2305.11627.
- The Penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
- NVIDIA. 2023. FasterTransformer.
- OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774.
- OpenAI. 2023b. Introducing ChatGPT.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1–67.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9): 99–106.
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS).
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
- Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. arXiv preprint arXiv:2305.11186.
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv preprint arXiv:2206.01861.
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv:2303.08302.
- RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv:2304.01089.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Root Mean Square Layer Normalization. arXiv:1910.07467.
- Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv:2305.12129.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129.
- Liang Li (297 papers)
- Qingyuan Li (11 papers)
- Bo Zhang (633 papers)
- Xiangxiang Chu (62 papers)