Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference
Abstract: For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.
- The description length of deep learning models. Advances in Neural Information Processing Systems, 31, 2018.
- Towards mixed-precision quantization of neural networks via constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5350–5359, 2021.
- Joint neural architecture search and quantization. arXiv preprint arXiv:1811.09426, 2018.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
- Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 293–302, 2019.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
- Backpropagation for energy-efficient neuromorphic computing. Advances in neural information processing systems, 28, 2015.
- Learned step size quantization. In International Conference on Learning Representations, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
- Keeping the neural networks simple by minimizing the description length of the weights. In COLT ’93, 1993.
- Flat minima. Neural computation, 9(1):1–42, 1997.
- A pytorch semantic segmentation toolbox, 2018.
- Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantization. arXiv preprint arXiv:2101.05938, 2021.
- Layer importance estimation with imprinting for neural network quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2408–2417, 2021a.
- Sharpness-aware quantization for deep neural networks. arXiv preprint arXiv:2111.12273, 2021b.
- Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. arXiv preprint arXiv:2111.14826, 2021c.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Autoq: Automated kernel-wise neural network quantization. arXiv preprint arXiv:1902.05690, 2019.
- Ompq: Orthogonal mixed precision quantization. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 9029–9037, 2023.
- Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.
- Discovering low-precision networks close to full-precision networks for efficient inference. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp. 6–9. IEEE, 2019.
- Neural inference at the frontier of energy, space, and time. Science, 382(6668):329–335, 2023.
- Bitpruning: Learning bitlengths for aggressive and accurate quantization. arXiv preprint arXiv:2002.03090, 2020.
- Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 580–595, 2018.
- Dynamical systems and ergodic theory. Cambridge University Press, 1998.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Springer, 2016.
- Rissanen, J. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
- Effective and fast: A novel sequential single path search for mixed-precision quantization. arXiv preprint arXiv:2103.02904, 2021.
- Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
- Wallace, C. S. Classification by minimum-message-length inference. In ICCI, 1990.
- Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8612–8620, 2019.
- Deep compression of pre-trained transformer models. Advances in Neural Information Processing Systems, 35:14140–14154, 2022.
- Entropy-constrained training of deep neural networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2019.
- Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
- Fracbits: Mixed precision quantization via fractional bit-widths. arXiv preprint arXiv:2007.02017, 1:2, 2020.
- Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp. 581–590. IEEE, 2020.
- Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875–11886. PMLR, 2021.
- Search what you want: Barrier panelty nas for mixed precision quantization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 1–16. Springer, 2020.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp. 36–39. IEEE, 2019.
- Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382, 2018.
- Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812, 2020.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
- Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
- Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.