Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2309.02784v2)

Published 6 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: As the size of LLMs continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (40)

Authors (4)

Liang Li (297 papers)
Qingyuan Li (11 papers)
Bo Zhang (633 papers)
Xiangxiang Chu (62 papers)

Citations (24)

View on Semantic Scholar

Tweets

https://twitter.com/flat/status/1825380782136873280

https://twitter.com/flat/status/1750887494420140345

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2309.02784v2)

Related Papers

Tweets