Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks (2312.08583v2)

Published 14 Dec 2023 in cs.CL and stat.ML

Abstract: This study examines 4-bit quantization methods like GPTQ in LLMs, highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Review of "Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers"

The paper "Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers" presents an in-depth analysis of quantization techniques in LLMs, primarily focusing on 4-bit quantization, specifically addressing the shortcomings of current methods like GPTQ. The paper introduces FP6 quantization as an alternative, promoting its utility in handling generative tasks such as code generation and abstractive summarization, where INT4 quantization tends to underperform.

The primary contribution of the paper lies in extending the evaluation of quantization methods beyond the typical zero-shot tasks to include more generative functions, which are critical in real-world applications. The authors identify significant overfitting issues with the GPTQ algorithm, noting its performance is often excessively tuned to specific datasets, exemplified through empirical results across several models. The investigation reveals that while existing methods reduce quantization losses, they do not fully address performance concerns in production environments, especially for smaller models, as reflected in perplexity and accuracy metrics.

A pivotal innovation of this paper is the introduction of an FP6 format, utilizing a novel 4+2 design. This new approach exhibits superior accuracy across a spectrum of complex tasks. The results demonstrate that FP6 quantization can reach a performance level on par with FP16 models, eliminating the accuracy gap seen with INT4. For instance, the \codestar-15B model with FP6 quantization closely mirrors FP16 results in code generation tasks, outperforming INT4 methodologies. Furthermore, the paper discusses the advantages of FP6 over potential alternatives like FP5, noting its stability and effectiveness.

The paper also explores system-level optimizations to support the FP6 format, proposing a bias shift mechanism that simplifies dequantization processes on GPU hardware. This involves a detailed implementation of the “4+2” bit splitting method to efficiently manage runtime dequantization and reduce latency—a critical consideration for low-precision formats.

From a practical perspective, this research propels critical advancements in reducing the resource footprint of LLMs while maintaining, or even enhancing, performance. This holds substantial promise for deploying large-scale models in environments constrained by hardware capabilities, potentially broadening the applicability of advanced AI models in diverse scenarios. The emphasis on integrating system optimizations to accommodate the FP6 format underscores the necessity of tailored hardware-software co-design to fully leverage algorithmic enhancements.

In future directions, the paper advocates for a comprehensive evaluation scope that includes a wider range of tasks beyond traditional benchmarks. This aligns with the evolving landscape of LLM applications where generative and sequence tasks bear greater prominence. Additionally, further investigation into other floating-point precisions, such as FP5, and their potential role in efficient model deployment, suggests a fertile area for continued research.

In summary, the findings and methodologies presented in this paper underscore important advancements in quantization techniques, emphasizing the significance of tailored precision formats like FP6 in enhancing the efficiency and applicability of LLMs. The research highlights ongoing challenges and opportunities in the deployment of LLMs, pointing towards a future where these models can operate optimally within the constraints of modern computational environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Copa: Constrained parafac2 for sparse & large datasets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 793–802, 2018.
  2. Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259, 2023.
  3. Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701, 2020.
  4. A systematic classification of knowledge, reasoning, and context within the arc dataset. arXiv preprint arXiv:1806.00358, 2018.
  5. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020.
  7. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. arXiv preprint arXiv:2001.05674, 2020.
  8. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  10. Wikipedia contributors. Floating-point arithmetic — Wikipedia, the free encyclopedia, last edited 2023. [Online; accessed 8-December-2023].
  11. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220, 2013.
  12. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124, 2019.
  13. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  14. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  15. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision, pages 293–302, 2019.
  16. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
  17. Training with quantization noise for extreme fixed-point compression. arXiv preprint arXiv:2004.07320, 2020.
  18. Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
  19. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  20. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  21. GitHub. Github copilot. https://github.com/features/copilot/, 2021.
  22. Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–15, 2023.
  23. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning. arXiv preprint arXiv:2311.12023, 2023.
  24. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
  25. Teaching machines to read and comprehend. arXiv preprint arXiv:1506.03340, 2015.
  26. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  27. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14152, 2023.
  28. I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
  29. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  30. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
  31. Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  32. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
  33. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
  34. Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
  35. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
  36. Starcoder: may the source be with you! 2305.06161, 2023.
  37. Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 203–211, 2022.
  38. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  39. Mftcoder: Boosting code llms with multitask fine-tuning. arXiv preprint arXiv, 2023.
  40. Llm-fp4: 4-bit floating-point quantized transformers. arXiv preprint arXiv:2310.16836, 2023.
  41. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34, 2021.
  42. Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Using Large Corpora, page 273, 1994.
  43. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
  44. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  45. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.
  46. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  47. Don’t give me the details, just the summary!: topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3706–3716, 2018.
  48. NVIDIA. cublas. "https://developer.nvidia.com/cublas", 2023.
  49. NVIDIA. Tensorrt-llm. "https://github.com/NVIDIA/TensorRT-LLM/", 2023.
  50. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
  54. Q-BERT: Hessian based ultra low precision quantization of bert. In AAAI, pages 8815–8821, 2020.
  55. Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705, 2022.
  56. Piqa: An algebra for querying protein data sets. In 15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150. IEEE, 2003.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
  59. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  60. How does calibration data affect the post-training pruning and quantization of large language models? arXiv preprint arXiv:2311.09755, 2023.
  61. Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases. arXiv preprint arXiv:2301.12017, 2023.
  62. Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782, 2023.
  63. Extreme compression for pre-trained transformers made simple and efficient. arXiv preprint arXiv:2206.01859, 2022.
  64. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
  65. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  66. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  67. Zeroquant-hero: Hardware-enhanced robust optimized post-training quantization framework for w8a8 transformers. arXiv preprint arXiv:2310.17723, 2023.
  68. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  69. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302, 2023.
  70. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
  71. Q8BERT: Quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019.
  72. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  73. Qpytorch: A low-precision arithmetic simulation framework, 2019.
  74. Integer or floating point? new outlooks for low-bit quantization on large language models. arXiv preprint arXiv:2305.12356, 2023.
  75. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  76. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Xiaoxia Wu (30 papers)
  2. Haojun Xia (4 papers)
  3. Stephen Youn (4 papers)
  4. Zhen Zheng (39 papers)
  5. Shiyang Chen (23 papers)
  6. Arash Bakhtiari (5 papers)
  7. Michael Wyatt (6 papers)
  8. Yuxiong He (59 papers)
  9. Olatunji Ruwase (20 papers)
  10. Leon Song (1 paper)
  11. Zhewei Yao (64 papers)
  12. Reza Yazdani Aminabadi (10 papers)
Citations (8)