Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 128 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training (2404.16898v1)

Published 25 Apr 2024 in cs.LG and cs.AI

Abstract: This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training: (1) scale and offset, (2) minimum and maximum, and (3) beta and gamma. We perform a comprehensive comparative analysis of these parameterizations' influence on quantization-aware training, using both controlled experiments and real-world LLMs. Our particular focus is on their changing behavior in response to critical training hyperparameters, bit width and learning rate. Based on our investigation, we propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Post training 4-bit quantization of convolutional networks for rapid-deployment. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
  3. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. PACT: parameterized clipping activation for quantized neural networks. CoRR, abs/1805.06085, 2018. URL http://arxiv.org/abs/1805.06085.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  7. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919, 2023. doi: 10.48550/ARXIV.2311.07919. URL https://doi.org/10.48550/arXiv.2311.07919.
  8. Scaling vision transformers to 22 billion parameters. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  7480–7512. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/dehghani23a.html.
  9. CBQ: cross-block quantization for large language models. CoRR, abs/2312.07950, 2023. doi: 10.48550/ARXIV.2312.07950. URL https://doi.org/10.48550/arXiv.2312.07950.
  10. Learned step size quantization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgO66VKDS.
  11. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022. doi: 10.48550/ARXIV.2210.17323. URL https://doi.org/10.48550/arXiv.2210.17323.
  12. OPTQ: accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=tcbBPnfwxS.
  13. Highly efficient 8-bit low precision inference of convolutional neural networks with intelcaffe. In Luis Ceze, Natalie D. Enright Jerger, Babak Falsafi, Grigori Fursin, Anton Lokhmotov, Thierry Moreau, Adrian Sampson, and Phillip Stanley-Marbell (eds.), Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning, ReQuEST@ASPLOS 2018, Williamsburg, VA, USA, March 24, 2018, pp.  2. ACM, 2018. doi: 10.1145/3229762.3229763. URL https://doi.org/10.1145/3229762.3229763.
  14. Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. CoRR, abs/2310.03270, 2023. doi: 10.48550/ARXIV.2310.03270. URL https://doi.org/10.48550/arXiv.2310.03270.
  15. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  16. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 2023a. doi: 10.48550/ARXIV.2305.14152. URL https://doi.org/10.48550/arXiv.2305.14152.
  17. Token-scaled logit distillation for ternary weight generative language models. CoRR, abs/2308.06744, 2023b. doi: 10.48550/ARXIV.2308.06744. URL https://doi.org/10.48550/arXiv.2308.06744.
  18. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  19. Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR, abs/1806.08342, 2018. URL http://arxiv.org/abs/1806.08342.
  20. Pruning vs quantization: Which is better? CoRR, abs/2307.02973, 2023. doi: 10.48550/ARXIV.2307.02973. URL https://doi.org/10.48550/arXiv.2307.02973.
  21. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978, 2023. doi: 10.48550/ARXIV.2306.00978. URL https://doi.org/10.48550/arXiv.2306.00978.
  22. QLLM: accurate and efficient low-bitwidth quantization for large language models. CoRR, abs/2310.08041, 2023a. doi: 10.48550/ARXIV.2310.08041. URL https://doi.org/10.48550/arXiv.2310.08041.
  23. LLM-QAT: data-free quantization aware training for large language models. CoRR, abs/2305.17888, 2023b. doi: 10.48550/ARXIV.2305.17888. URL https://doi.org/10.48550/arXiv.2305.17888.
  24. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24:253:1–253:15, 2023. URL http://jmlr.org/papers/v24/23-0069.html.
  25. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  26. A white paper on neural network quantization. CoRR, abs/2106.08295, 2021. URL https://arxiv.org/abs/2106.08295.
  27. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  28. Omniquant: Omnidirectionally calibrated quantization for large language models. CoRR, abs/2308.13137, 2023. doi: 10.48550/ARXIV.2308.13137. URL https://doi.org/10.48550/arXiv.2308.13137.
  29. Neural network quantization with AI model efficiency toolkit (AIMET). CoRR, abs/2201.08442, 2022. URL https://arxiv.org/abs/2201.08442.
  30. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  31. Understanding int4 quantization for language models: Latency speedup, composability, and failure cases. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  37524–37539. PMLR, 2023. URL https://proceedings.mlr.press/v202/wu23k.html.
  32. Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  38087–38099. PMLR, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html.
  33. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/ARXIV.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.