How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training (2404.16898v1)
Abstract: This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training: (1) scale and offset, (2) minimum and maximum, and (3) beta and gamma. We perform a comprehensive comparative analysis of these parameterizations' influence on quantization-aware training, using both controlled experiments and real-world LLMs. Our particular focus is on their changing behavior in response to critical training hyperparameters, bit width and learning rate. Based on our investigation, we propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.
- Post training 4-bit quantization of convolutional networks for rapid-deployment. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf.
- Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
- Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- PACT: parameterized clipping activation for quantized neural networks. CoRR, abs/1805.06085, 2018. URL http://arxiv.org/abs/1805.06085.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919, 2023. doi: 10.48550/ARXIV.2311.07919. URL https://doi.org/10.48550/arXiv.2311.07919.
- Scaling vision transformers to 22 billion parameters. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 7480–7512. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/dehghani23a.html.
- CBQ: cross-block quantization for large language models. CoRR, abs/2312.07950, 2023. doi: 10.48550/ARXIV.2312.07950. URL https://doi.org/10.48550/arXiv.2312.07950.
- Learned step size quantization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgO66VKDS.
- GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022. doi: 10.48550/ARXIV.2210.17323. URL https://doi.org/10.48550/arXiv.2210.17323.
- OPTQ: accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=tcbBPnfwxS.
- Highly efficient 8-bit low precision inference of convolutional neural networks with intelcaffe. In Luis Ceze, Natalie D. Enright Jerger, Babak Falsafi, Grigori Fursin, Anton Lokhmotov, Thierry Moreau, Adrian Sampson, and Phillip Stanley-Marbell (eds.), Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning, ReQuEST@ASPLOS 2018, Williamsburg, VA, USA, March 24, 2018, pp. 2. ACM, 2018. doi: 10.1145/3229762.3229763. URL https://doi.org/10.1145/3229762.3229763.
- Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. CoRR, abs/2310.03270, 2023. doi: 10.48550/ARXIV.2310.03270. URL https://doi.org/10.48550/arXiv.2310.03270.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 2023a. doi: 10.48550/ARXIV.2305.14152. URL https://doi.org/10.48550/arXiv.2305.14152.
- Token-scaled logit distillation for ternary weight generative language models. CoRR, abs/2308.06744, 2023b. doi: 10.48550/ARXIV.2308.06744. URL https://doi.org/10.48550/arXiv.2308.06744.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR, abs/1806.08342, 2018. URL http://arxiv.org/abs/1806.08342.
- Pruning vs quantization: Which is better? CoRR, abs/2307.02973, 2023. doi: 10.48550/ARXIV.2307.02973. URL https://doi.org/10.48550/arXiv.2307.02973.
- AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978, 2023. doi: 10.48550/ARXIV.2306.00978. URL https://doi.org/10.48550/arXiv.2306.00978.
- QLLM: accurate and efficient low-bitwidth quantization for large language models. CoRR, abs/2310.08041, 2023a. doi: 10.48550/ARXIV.2310.08041. URL https://doi.org/10.48550/arXiv.2310.08041.
- LLM-QAT: data-free quantization aware training for large language models. CoRR, abs/2305.17888, 2023b. doi: 10.48550/ARXIV.2305.17888. URL https://doi.org/10.48550/arXiv.2305.17888.
- Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24:253:1–253:15, 2023. URL http://jmlr.org/papers/v24/23-0069.html.
- Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe.
- A white paper on neural network quantization. CoRR, abs/2106.08295, 2021. URL https://arxiv.org/abs/2106.08295.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
- Omniquant: Omnidirectionally calibrated quantization for large language models. CoRR, abs/2308.13137, 2023. doi: 10.48550/ARXIV.2308.13137. URL https://doi.org/10.48550/arXiv.2308.13137.
- Neural network quantization with AI model efficiency toolkit (AIMET). CoRR, abs/2201.08442, 2022. URL https://arxiv.org/abs/2201.08442.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
- Understanding int4 quantization for language models: Latency speedup, composability, and failure cases. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 37524–37539. PMLR, 2023. URL https://proceedings.mlr.press/v202/wu23k.html.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 38087–38099. PMLR, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/ARXIV.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.