EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Abstract: LLMs have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
- A systematic classification of knowledge, reasoning, and context within the arc dataset. arXiv preprint arXiv:1806.00358.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale.
- Bert: Pre-training of deep bidirectional transformers for language understanding. Cite arxiv:1810.04805Comment: 13 pages.
- Gptq: Accurate post-training quantization for generative pre-trained transformers.
- Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. pages 2704–2713.
- Neural networks with few multiplications. arXiv preprint arXiv:1510.03009.
- The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
- Pointer sentinel mixture models.
- Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51.
- Wrapnet: Neural net inference with ultra-low-resolution arithmetic. arXiv preprint arXiv:2007.13242.
- Fully quantized transformer for improved translation. arXiv preprint arXiv:1910.10485.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Degree-quant: Quantization-aware training for graph neural networks. International Conference on Learning Representations.
- Sandeep Tata and Jignesh M Patel. 2003. Piqa: An algebra for querying protein data sets. In 15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150. IEEE.
- Llama: Open and efficient foundation language models.
- Attention is all you need. CoRR, abs/1706.03762.
- Outlier suppression: Pushing the limit of low-bit transformer language models.
- BigScience Workshop. 2023. Bloom: A 176b-parameter open-access multilingual language model.
- Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases.
- Smoothquant: Accurate and efficient post-training quantization for large language models.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.
- Q8bert: Quantized 8bit bert. arXiv preprint arXiv:1910.06188.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Opt: Open pre-trained transformer language models.
- Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.