Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2309.02784v2)

Published 6 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: As the size of LLMs continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032.
  2. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432–7439.
  3. Language models are few-shot learners. In Conference on Neural Information Processing Systems (NeurIPS).
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135.
  6. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339.
  7. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360.
  8. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  10. A framework for few-shot language model evaluation.
  11. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks.
  12. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop.
  13. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
  14. The BigScience Corpus: A 1.6 TB Composite Multilingual Dataset.
  15. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR.
  16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
  17. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888.
  18. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv:2305.11627.
  19. The Penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  20. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  21. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  22. NVIDIA. 2023. FasterTransformer.
  23. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774.
  24. OpenAI. 2023b. Introducing ChatGPT.
  25. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
  26. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1–67.
  27. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9): 99–106.
  28. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS).
  31. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
  32. Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. arXiv preprint arXiv:2305.11186.
  33. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv preprint arXiv:2206.01861.
  34. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv:2303.08302.
  35. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv:2304.01089.
  36. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  37. Root Mean Square Layer Normalization. arXiv:1910.07467.
  38. Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv:2305.12129.
  39. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  40. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Liang Li (297 papers)
  2. Qingyuan Li (11 papers)
  3. Bo Zhang (633 papers)
  4. Xiangxiang Chu (62 papers)
Citations (24)