Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Published 30 Jan 2023 in cs.LG and cs.CV | (2301.13330v2)

Abstract: For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. The description length of deep learning models. Advances in Neural Information Processing Systems, 31, 2018.
  2. Towards mixed-precision quantization of neural networks via constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5350–5359, 2021.
  3. Joint neural architecture search and quantization. arXiv preprint arXiv:1811.09426, 2018.
  4. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  5. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3213–3223, 2016.
  6. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
  7. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  293–302, 2019.
  10. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
  11. Backpropagation for energy-efficient neuromorphic computing. Advances in neural information processing systems, 28, 2015.
  12. Learned step size quantization. In International Conference on Learning Representations, 2020.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  15. Keeping the neural networks simple by minimizing the description length of the weights. In COLT ’93, 1993.
  16. Flat minima. Neural computation, 9(1):1–42, 1997.
  17. A pytorch semantic segmentation toolbox, 2018.
  18. Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantization. arXiv preprint arXiv:2101.05938, 2021.
  19. Layer importance estimation with imprinting for neural network quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2408–2417, 2021a.
  20. Sharpness-aware quantization for deep neural networks. arXiv preprint arXiv:2111.12273, 2021b.
  21. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. arXiv preprint arXiv:2111.14826, 2021c.
  22. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  23. Autoq: Automated kernel-wise neural network quantization. arXiv preprint arXiv:1902.05690, 2019.
  24. Ompq: Orthogonal mixed precision quantization. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp.  9029–9037, 2023.
  25. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.
  26. Discovering low-precision networks close to full-precision networks for efficient inference. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.  6–9. IEEE, 2019.
  27. Neural inference at the frontier of energy, space, and time. Science, 382(6668):329–335, 2023.
  28. Bitpruning: Learning bitlengths for aggressive and accurate quantization. arXiv preprint arXiv:2002.03090, 2020.
  29. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  580–595, 2018.
  30. Dynamical systems and ergodic theory. Cambridge University Press, 1998.
  31. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  32. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp.  525–542. Springer, 2016.
  33. Rissanen, J. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  34. Effective and fast: A novel sequential single path search for mixed-precision quantization. arXiv preprint arXiv:2103.02904, 2021.
  35. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
  36. Wallace, C. S. Classification by minimum-message-length inference. In ICCI, 1990.
  37. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8612–8620, 2019.
  38. Deep compression of pre-trained transformer models. Advances in Neural Information Processing Systems, 35:14140–14154, 2022.
  39. Entropy-constrained training of deep neural networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pp.  1–8. IEEE, 2019.
  40. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
  41. Fracbits: Mixed precision quantization via fractional bit-widths. arXiv preprint arXiv:2007.02017, 1:2, 2020.
  42. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp.  581–590. IEEE, 2020.
  43. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875–11886. PMLR, 2021.
  44. Search what you want: Barrier panelty nas for mixed precision quantization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp.  1–16. Springer, 2020.
  45. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.  36–39. IEEE, 2019.
  46. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp.  365–382, 2018.
  47. Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812, 2020.
  48. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2881–2890, 2017.
  49. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  50. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  51. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.