QBitOpt: Fast and Accurate Bitwidth Reallocation during Training (2307.04535v1)
Abstract: Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocation is a challenging problem as the search space grows exponentially with the number of layers in the network. In this paper, we propose QBitOpt, a novel algorithm for updating bitwidths during quantization-aware training (QAT). We formulate the bitwidth allocation problem as a constraint optimization problem. By combining fast-to-compute sensitivities with efficient solvers during QAT, QBitOpt can produce mixed-precision networks with high task performance guaranteed to satisfy strict resource constraints. This contrasts with existing mixed-precision methods that learn bitwidths using gradients and cannot provide such guarantees. We evaluate QBitOpt on ImageNet and confirm that we outperform existing fixed and mixed-precision methods under average bitwidth constraints commonly found in the literature.
- Gradient l-1 regularization for quantization robustness. arXiv preprint arXiv:2002.07520, 2020.
- Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31, 2018.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Convex optimization. Cambridge university press, 2004.
- Language models are few-shot learners. 2020.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
- Differentiable model compression via pseudo quantization noise. Apr. 2021.
- CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Differentiable model compression via pseudo quantization noise. arXiv preprint arXiv:2104.09987, 2021.
- Learned step size quantization. In International Conference on Learning Representations (ICLR), 2020.
- Optimal brain compression: A framework for accurate post-training quantization and pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998.
- Deep learning with limited numerical precision. In International conference on machine learning, pages 1737–1746. PMLR, 2015.
- Hmq: Hardware friendly mixed precision quantization block for cnns. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 448–463. Springer, 2020.
- Geoffrey Hinton. Neural networks for machine learning, lectures 15b. 2012.
- M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, 2014.
- B. Ham J. Lee, D. Kim. Network quantization with element-wise gradient scaling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arxiv preprint arxiv:1903.08066, 2019.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- {BRECQ}: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2021.
- Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
- A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
- Overcoming oscillations in quantization-aware training. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16318–16330. PMLR, 17–23 Jul 2022.
- Data-free quantization through weight equalization and bias correction. In International Conference on Computer Vision (ICCV), 2019.
- A practical mixed precision algorithm for post-training quantization. arXiv preprint arXiv:2302.05397, 2023.
- Christos H Papadimitriou. On the complexity of integer programming. Journal of the ACM (JACM), 28(4):765–768, 1981.
- Improving language understanding by generative pre-training. 2018.
- Leveraging automated mixed-low-precision quantization for tiny edge microcontrollers. In IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning: Second International Workshop, IoT Streams 2020, and First International Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers 2, pages 296–308. Springer, 2020.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
- Mixed precision post training quantization of neural networks with sensitivity guided search. arXiv preprint arXiv:2302.01382, 2023.
- Nipq: Noise proxy-based integrated pseudo-quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3852–3861, June 2023.
- Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022.
- Mixed precision dnns: All you need is a good parametrization. In International Conference on Learning Representations (ICLR), 2020.
- Bayesian bits: Unifying quantization and pruning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5741–5752. Curran Associates, Inc., 2020.
- Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
- Towards accurate post-training network quantization via bit-split and stitching. In Proceedings of the 37nd International Conference on Machine Learning (ICML), pages 243–252, July 2020.
- Bernard Widrow. A study of rough amplitude quantization by means of nyquist sampling theory. IRE Transactions on Circuit Theory, 3(4):266–276, 1956.
- Bernard Widrow. Statistical analysis of amplitude-quantized sampled-data systems. Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry, 79(6):555–568, 1961.
- Statistical theory of quantization. IEEE Trans. Instrum. Meas., 45(2):353–361, Feb. 1996.
- Numerical optimization. Springer Science, 35(67-68):7, 1999.
- Bsq: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462, 2021.
- Hawq-v3: Dyadic neural network quantization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11875–11886. PMLR, 18–24 Jul 2021.
- Fit: A metric for model sensitivity. arXiv preprint arXiv:2210.08502, 2022.
- Distribution-aware adaptive multi-bit quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021.