Adaptive quantization with mixed-precision based on low-cost proxy (2402.17706v1)
Abstract: It is critical to deploy complicated neural network models on hardware with limited resources. This paper proposes a novel model quantization method, named the Low-Cost Proxy-Based Adaptive Mixed-Precision Model Quantization (LCPAQ), which contains three key modules. The hardware-aware module is designed by considering the hardware limitations, while an adaptive mixed-precision quantization module is developed to evaluate the quantization sensitivity by using the Hessian matrix and Pareto frontier techniques. Integer linear programming is used to fine-tune the quantization across different layers. Then the low-cost proxy neural architecture search module efficiently explores the ideal quantization hyperparameters. Experiments on the ImageNet demonstrate that the proposed LCPAQ achieves comparable or superior quantization accuracy to existing mixed-precision models. Notably, LCPAQ achieves 1/200 of the search time compared with existing methods, which provides a shortcut in practical quantization use for resource-limited devices.
- Nelson Morgan et al., “Experimental determination of precision requirements for back-propagation training of artificial neural networks,” in Proc. Second Int’l. Conf. Microelectronics for Neural Networks. Citeseer, 1991, pp. 9–16.
- “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021.
- Raghuraman Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
- “Lsq+: Improving low-bit quantization through learnable offsets and better initialization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 696–697.
- “One weight bitwidth to rule them all,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 2020, pp. 85–103.
- “Hawq: Hessian aware quantization of neural networks with mixed-precision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 293–302.
- “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713.
- “Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors,” IEEE Access, vol. 9, pp. 20828–20839, 2021.
- “Drq: dynamic region-based quantization for deep neural network acceleration,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 1010–1021.
- “Energy-efficient neural network accelerator based on outlier-aware low-precision computation,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 688–698.
- “Cabm: Content-aware bit mapping for single image super-resolution network with large input,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1756–1765, 2023.
- “Haq: Hardware-aware automated quantization with mixed precision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8612–8620.
- “Mixed precision quantization of convnets via differentiable neural architecture search,” arXiv preprint arXiv:1812.00090, 2018.
- “Hawq-v2: Hessian aware trace-weighted quantization of neural networks,” Advances in neural information processing systems, vol. 33, pp. 18518–18529, 2020.
- “Hawq-v3: Dyadic neural network quantization,” in International Conference on Machine Learning. PMLR, 2021, pp. 11875–11886.
- “Bayesian bits: Unifying quantization and pruning,” Advances in neural information processing systems, vol. 33, pp. 5741–5752, 2020.
- “Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix,” Journal of the ACM (JACM), vol. 58, no. 2, pp. 1–34, 2011.
- “Some large-scale matrix computation problems,” Journal of Computational and Applied Mathematics, vol. 74, no. 1-2, pp. 71–89, 1996.
- “Model compression via distillation and quantization,” arXiv preprint arXiv:1802.05668, 2018.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- Junzhe Chen (14 papers)
- Qiao Yang (6 papers)
- Senmao Tian (4 papers)
- Shunli Zhang (15 papers)