Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A2Q+: Improving Accumulator-Aware Weight Quantization (2401.10432v1)

Published 19 Jan 2024 in cs.LG, cs.AR, and cs.PF

Abstract: Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We support our analysis with experiments that show A2Q+ significantly improves the trade-off between accumulator bit width and model accuracy and characterize new trade-offs that arise as a consequence of accumulator constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Post-training quantization with low-precision minifloats and integers on FPGAs. arXiv preprint arXiv:2311.12359, 2023.
  2. Squeezing accumulators in binary neural networks for extremely resource-constrained applications. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, pp.  1–7, 2022.
  3. Challenges in the deployment and operation of machine learning in practice. In ECIS, volume 1, 2019.
  4. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  5. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(3):1–23, 2018.
  6. Towards cheaper inference in deep networks with lower bit-width accumulators. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023), 2023.
  7. A competitive edge: Can FPGAs beat GPUs at DCNN inference acceleration in resource-limited edge computing applications? arXiv preprint arXiv:2102.00294, 2021a.
  8. An energy-efficient edge computing paradigm for convolution-based image upsampling. IEEE Access, 9:147967–147984, 2021b.
  9. A2Q: Accumulator-aware quantization with guaranteed overflow avoidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16989–16998, 2023.
  10. Quantization of deep neural networks for accumulator-constrained processors. Microprocessors and Microsystems, 72:102872, 2020.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp.  272–279, 2008.
  13. Fast sparse convnets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14629–14638, 2020.
  14. Sparse GPU kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14. IEEE, 2020.
  15. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  16. Inequalities. Cambridge university press, 1952.
  17. Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence, 4:676564, 2021.
  18. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  20. Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pp.  10–14. IEEE, 2014.
  21. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  22. Centered weight normalization in accelerating training of deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp.  2803–2811, 2017.
  23. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2704–2713, 2018.
  24. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proceedings of Machine Learning and Systems, 2:112–128, 2020.
  25. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. Learning multiple layers of features from tiny images. 2009.
  27. Downscaling and overflow-aware model compression for efficient vision processors. In 2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW), pp.  145–150. IEEE, 2022.
  28. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. arXiv preprint arXiv:1909.13144, 2019.
  29. Tensorquant: A simulation toolbox for deep neural network quantization. In Proceedings of the Machine Learning on HPC Environments, pp.  1–8. 2017.
  30. She: A fast and accurate deep neural network for encrypted data. Advances in Neural Information Processing Systems, 32, 2019.
  31. Exploring the granularity of sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  13–20, 2017.
  32. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pp.  416–423. IEEE, 2001.
  33. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, 2021.
  34. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1325–1334, 2019.
  35. Wrapnet: Neural net inference with ultra-low-precision arithmetic. In International Conference on Learning Representations ICLR 2021. OpenReview, 2021.
  36. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, pp.  5–14, 2017.
  37. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.
  38. Pappalardo, A. Xilinx/brevitas, 2021. URL https://doi.org/10.5281/zenodo.3333552.
  39. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
  40. Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520, 2019.
  41. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  42. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
  43. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1874–1883, 2016.
  44. Rigid-motion scattering for texture classification. arXiv preprint arXiv:1403.1687, 2014.
  45. Deep neural networks for encrypted inference with TFHE. arXiv preprint arXiv:2302.10906, 2023.
  46. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060, 2017.
  47. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pp.  65–74. ACM, 2017.
  48. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020.
  49. Understanding int4 quantization for language models: latency speedup, composability, and failure cases. In International Conference on Machine Learning, pp. 37524–37539. PMLR, 2023.
  50. Overflow aware quantization: Accelerating neural network inference by low-bit multiply-accumulate operations. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp.  868–875, 2021.
  51. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.
  52. HAWQ-V3: Dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875–11886. PMLR, 2021.
  53. Learning low-precision structured subnetworks using joint layerwise channel pruning and uniform quantization. Applied Sciences, 12(15):7829, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.