Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 402 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Minimize Quantization Output Error with Bias Compensation (2404.01892v1)

Published 2 Apr 2024 in cs.CV

Abstract: Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and LLMs, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in https://github.com/GongCheng1919/bias-compensation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Post training 4-bit quantization of convolutional networks for rapid-deployment. In NeurIPS, 2019.
  2. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, 2009.
  3. The case for 4-bit precision: k-bit inference scaling laws. In ICML, 2023.
  4. 8-bit optimizers via block-wise quantization. In ICLR, 2021.
  5. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
  6. Qlora: Efficient finetuning of quantized llms. NeurIPS, 2023.
  7. Towards accurate post-training quantization for vision transformer. In ACM Multimedia, 2022.
  8. Hawq: Hessian aware quantization of neural networks with mixed-precision. In IEEE ICCV, 2019.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  10. Optimal brain compression: A framework for accurate post-training quantization and pruning. In NeurIPS, 2022.
  11. M-FAC: Efficient matrix-free approximations of second-order information. In NeurIPS, 2021.
  12. OPTQ: Accurate quantization for generative pre-trained transformers. In ICLR, 2023.
  13. VecQ: Minimal loss dnn model compression with vectorized weight quantization. IEEE TOC, 2020.
  14. Elastic significant bit quantization and acceleration for deep neural networks. IEEE TPDS, 2021.
  15. Optimal brain surgeon and general network pruning. In IEEE ICNN, 1993.
  16. LoRA: Low-rank adaptation of large language models. In ICLR, 2021.
  17. Accurate post training quantization with small calibration sets. In ICML, 2021.
  18. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. NeurIPS, 2023.
  19. Low-bit quantization of neural networks for efficient inference. In ICCVW, 2019.
  20. Flexround: Learnable rounding based on element-wise division for post-training quantization. ICML, 2023.
  21. Brecq: Pushing the limit of post-training quantization by block reconstruction. In ICLR, 2020.
  22. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In IEEE ICCV, 2023.
  23. AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  24. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In IEEE CVPR, 2023a.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE ICCV, 2021.
  26. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023b.
  27. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, 1994.
  28. Pointer sentinel mixture models. In ICLR, 2016.
  29. Up or down? adaptive rounding for post-training quantization. In ICML, 2020.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  31. Not all bits have equal value: Heterogeneous precisions via trainable noise. In NeurIPS, 2022.
  32. Nipq: Noise proxy-based integrated pseudo-quantization. In IEEE CVPR, 2023.
  33. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  34. Towards accurate post-training network quantization via bit-split and stitching. In ICML, 2020.
  35. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  36. Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  37. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  38. Easyquant: Post-training quantization via scale optimization. arXiv preprint arXiv:2006.16669, 2020.
  39. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, 2023.
  40. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. In NeurIPS, 2022.
  41. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In ECCV, 2022.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  43. Improving neural network quantization without retraining using outlier channel splitting. In ICML, 2019.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.