ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Abstract: This study examines 4-bit quantization methods like GPTQ in LLMs, highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.
- Copa: Constrained parafac2 for sparse & large datasets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 793–802, 2018.
- Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259, 2023.
- Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701, 2020.
- A systematic classification of knowledge, reasoning, and context within the arc dataset. arXiv preprint arXiv:1806.00358, 2018.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020.
- Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. arXiv preprint arXiv:2001.05674, 2020.
- Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Wikipedia contributors. Floating-point arithmetic — Wikipedia, the free encyclopedia, last edited 2023. [Online; accessed 8-December-2023].
- Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220, 2013.
- The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124, 2019.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
- HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision, pages 293–302, 2019.
- Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
- Training with quantization noise for extreme fixed-point compression. arXiv preprint arXiv:2004.07320, 2020.
- Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- GitHub. Github copilot. https://github.com/features/copilot/, 2021.
- Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–15, 2023.
- Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning. arXiv preprint arXiv:2311.12023, 2023.
- Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
- Teaching machines to read and comprehend. arXiv preprint arXiv:1506.03340, 2015.
- Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14152, 2023.
- I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
- Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
- Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
- Starcoder: may the source be with you! 2305.06161, 2023.
- Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 203–211, 2022.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Mftcoder: Boosting code llms with multitask fine-tuning. arXiv preprint arXiv, 2023.
- Llm-fp4: 4-bit floating-point quantized transformers. arXiv preprint arXiv:2310.16836, 2023.
- Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34, 2021.
- Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Using Large Corpora, page 273, 1994.
- Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
- Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
- Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
- Don’t give me the details, just the summary!: topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3706–3716, 2018.
- NVIDIA. cublas. "https://developer.nvidia.com/cublas", 2023.
- NVIDIA. Tensorrt-llm. "https://github.com/NVIDIA/TensorRT-LLM/", 2023.
- nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
- Q-BERT: Hessian based ultra low precision quantization of bert. In AAAI, pages 8815–8821, 2020.
- Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705, 2022.
- Piqa: An algebra for querying protein data sets. In 15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150. IEEE, 2003.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
- Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
- How does calibration data affect the post-training pruning and quantization of large language models? arXiv preprint arXiv:2311.09755, 2023.
- Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases. arXiv preprint arXiv:2301.12017, 2023.
- Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782, 2023.
- Extreme compression for pre-trained transformers made simple and efficient. arXiv preprint arXiv:2206.01859, 2022.
- Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- Zeroquant-hero: Hardware-enhanced robust optimized post-training quantization framework for w8a8 transformers. arXiv preprint arXiv:2310.17723, 2023.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
- Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302, 2023.
- Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
- Q8BERT: Quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Qpytorch: A low-precision arithmetic simulation framework, 2019.
- Integer or floating point? new outlooks for low-bit quantization on large language models. arXiv preprint arXiv:2305.12356, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.