Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip (2405.00645v2)

Published 1 May 2024 in cs.LG and physics.ins-det

Abstract: Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method that could fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations with an arbitrary number of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Edge ai: A survey. Internet of Things and Cyber-Physical Systems 3, 71–92 (2023). URL https://www.sciencedirect.com/science/article/pii/S2667345223000196.
  2. Niu, W. et al. Grim: A general, real-time deep learning inference framework for mobile devices based on fine-grained structured weight sparsity. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6224–6239 (2022). URL https://doi.org/10.1109/TPAMI.2021.3089687.
  3. Real-time neural network inference on extremely weak devices: agile offloading with explainable ai. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, MobiCom ’22, 200–213 (Association for Computing Machinery, New York, NY, USA, 2022). URL https://doi.org/10.1145/3495243.3560551.
  4. Yang, Y. et al. Streamvc: Real-time low-latency voice conversion (2024). URL https://google-research.github.io/seanet/stream_vc/.
  5. The CMS Collaboration. The Phase-2 Upgrade of the CMS Level-1 Trigger. Tech. Rep., CERN, Geneva (2020). URL https://cds.cern.ch/record/2714892. Final version.
  6. The ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the ATLAS TDAQ System. Tech. Rep., CERN, Geneva (2017). URL https://cds.cern.ch/record/2285584.
  7. Zurbano Fernandez, I. et al. High-Luminosity Large Hadron Collider (HL-LHC): Technical design report. CERN Yellow Reports: Monographs 10/2020 (2020).
  8. Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys 55, 1 – 37 (2021). URL https://api.semanticscholar.org/CorpusID:235446458.
  9. Model compression for deep neural networks: A survey. Computers 12 (2023). URL https://www.mdpi.com/2073-431X/12/3/60.
  10. Coelho, C. N. et al. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence 3, 675–686 (2021). URL https://doi.org/10.1038%2Fs42256-021-00356-5.
  11. Ngadiuba, J. et al. Compressing deep neural networks on fpgas to binary and ternary precision with hls4ml. Machine Learning: Science and Technology 2, 015001 (2020). URL https://dx.doi.org/10.1088/2632-2153/aba042.
  12. Zhou, S. et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160 (2016). URL http://arxiv.org/abs/1606.06160. 1606.06160.
  13. Towards accurate binary convolutional neural network. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017). URL https://proceedings.neurips.cc/paper_files/paper/2017/file/b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf.
  14. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR abs/1511.00363 (2015). URL http://arxiv.org/abs/1511.00363. 1511.00363.
  15. Xnor-net: Imagenet classification using binary convolutional neural networks. In Leibe, B., Matas, J., Sebe, N. & Welling, M. (eds.) Computer Vision – ECCV 2016, 525–542 (Springer International Publishing, Cham, 2016).
  16. Ternary weight networks (2022). 1605.04711.
  17. Trained ternary quantization (2017). 1612.01064.
  18. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11430–11438 (2018).
  19. Xu, C. et al. Alternating multi-bit quantization for recurrent neural networks. CoRR abs/1802.00150 (2018). URL http://arxiv.org/abs/1802.00150. 1802.00150.
  20. Network sketching: Exploiting binary structure in deep cnns. CoRR abs/1706.02021 (2017). URL http://arxiv.org/abs/1706.02021. 1706.02021.
  21. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. CoRR abs/1807.10029 (2018). URL http://arxiv.org/abs/1807.10029. 1807.10029.
  22. Adaptive loss-aware quantization for multi-bit networks. CoRR abs/1912.08883 (2019). URL http://arxiv.org/abs/1912.08883. 1912.08883.
  23. Chang, S.-E. et al. Mix and match: A novel fpga-centric deep neural network quantization framework. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 208–220 (2021).
  24. Hardware-centric automl for mixed-precision quantization. International Journal of Computer Vision 128, 2035–2048 (2020). URL https://doi.org/10.1007/s11263-020-01339-6.
  25. HAWQ: hessian aware quantization of neural networks with mixed-precision. CoRR abs/1905.03696 (2019). URL http://arxiv.org/abs/1905.03696. 1905.03696.
  26. Dong, Z. et al. HAWQ-V2: hessian aware trace-weighted quantization of neural networks. CoRR abs/1911.03852 (2019). URL http://arxiv.org/abs/1911.03852. 1911.03852.
  27. Pyhessian: Neural networks through the lens of the hessian. 2020 IEEE International Conference on Big Data (Big Data) 581–590 (2019). URL https://api.semanticscholar.org/CorpusID:209376531.
  28. Choi, J. et al. Bridging the accuracy gap for 2-bit quantized neural networks (QNN). CoRR abs/1807.06964 (2018). URL http://arxiv.org/abs/1807.06964. 1807.06964.
  29. Wu, B. et al. Mixed precision quantization of convnets via differentiable neural architecture search. CoRR abs/1812.00090 (2018). URL http://arxiv.org/abs/1812.00090. 1812.00090.
  30. Que, Z. et al. Metaml: Automating customizable cross-stage design-flow for deep learning acceleration. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 248–252 (2023).
  31. Value-aware quantization for training and inference of neural networks. In Ferrari, V., Hebert, M., Sminchisescu, C. & Weiss, Y. (eds.) Computer Vision – ECCV 2018, 608–624 (Springer International Publishing, Cham, 2018).
  32. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR (2022).
  33. Dettmers, T. et al. Spqr: A sparse-quantized representation for near-lossless llm weight compression (2023). 2306.03078.
  34. Autoq: Automated kernel-wise neural network quantization. In International Conference on Learning Representations (2020). URL https://openreview.net/forum?id=rygfnn4twS.
  35. Sun, M. et al. Film-qnn: Efficient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2022). URL https://doi.org/10.1145/3490422.3502364.
  36. Optimal brain damage. In Proceedings of the 2nd International Conference on Neural Information Processing Systems, NIPS’89, 598–605 (MIT Press, Cambridge, MA, USA, 1989).
  37. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 293–299 vol.1 (1993).
  38. Fpga resource-aware structured pruning for real-time neural networks (2023). 2308.05170.
  39. Meng, F. et al. Pruning filter in filter. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, 17629–17640 (Curran Associates, Inc., 2020). URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ccb1d45fb76f7c5a0bf619f979c6cf36-Paper.pdf.
  40. Li, Y. et al. Differentiable transportation pruning (2023). 2307.08483.
  41. Why lottery ticket wins? a theoretical perspective of sample complexity on sparse neural networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 2707–2720 (Curran Associates, Inc., 2021). URL https://proceedings.neurips.cc/paper_files/paper/2021/file/15f99f2165aa8c86c9dface16fefd281-Paper.pdf.
  42. On lottery tickets and minimal task representations in deep reinforcement learning. In International Conference on Learning Representations (2022). URL https://openreview.net/forum?id=Fl3Mg_MZR-.
  43. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (2019). URL https://openreview.net/forum?id=rJl-b3RcF7.
  44. Miao, L. et al. Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining. In International Conference on Learning Representations (2022). URL https://openreview.net/forum?id=O1DEtITim__.
  45. Pruning randomly initialized neural networks with iterative randomization. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 4503–4513 (Curran Associates, Inc., 2021). URL https://proceedings.neurips.cc/paper_files/paper/2021/file/23e582ad8087f2c03a5a31c125123f9a-Paper.pdf.
  46. Abadi, M. et al. TensorFlow: Large-scale machine learning on heterogeneous systems (2015). URL https://www.tensorflow.org/. Software available from tensorflow.org.
  47. Fahim, F. et al. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. CoRR abs/2103.05579 (2021). URL https://arxiv.org/abs/2103.05579. 2103.05579.
  48. Xilinx/brevitas: Release version 0.2.1 (2021). URL https://doi.org/10.5281/zenodo.4507794.
  49. Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H. M., Larochelle, A. P., Beygelzimer, A. P., d’Alché Buc, A. P. & Fox, A. P. B. (eds.) Proceedings of the 33rd International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2019).
  50. Umuroglu, Y. et al. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM Press, 2017). 1612.07119.
  51. Blott, M. et al. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. 11 (2018). 1809.04570.
  52. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432 (2013). URL http://arxiv.org/abs/1308.3432. 1308.3432.
  53. Baskin, C. et al. UNIQ. ACM Transactions on Computer Systems 37, 1–15 (2019). URL https://doi.org/10.1145%2F3444943.
  54. Elthakeb, A. T. et al. Waveq: Gradient-based deep quantization of neural networks through sinusoidal adaptive regularization (2020). 2003.00146.
  55. Quantization aware training with absolute-cosine regularization for automatic speech recognition. In Interspeech (2020). URL https://api.semanticscholar.org/CorpusID:226203265.
  56. Chollet, F. et al. Keras. https://keras.io (2015).
  57. Aarrestad, T. et al. Fast convolutional neural networks on fpgas with hls4ml. Machine Learning: Science and Technology 2, 045015 (2021). URL https://dx.doi.org/10.1088/2632-2153/ac0ea1.
  58. Fast muon tracking with machine learning implemented in fpga. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 1045, 167546 (2023). URL http://dx.doi.org/10.1016/j.nima.2022.167546.
  59. Hls4ml lhc jet dataset (150 particles) (2020). URL https://doi.org/10.5281/zenodo.3602260.
  60. Logicnets: Co-designed neural networks and circuits for extreme-throughput applications. 2020 30th International Conference on Field-Programmable Logic and Applications (FPL) 291–297 (2020). URL https://doi.org/10.1109/FPL50879.2020.00055.
  61. Symbolnet: Neural symbolic regression with adaptive dynamic pruning (2024). 2401.09949.
  62. Netzer, Y. et al. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011).
  63. LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Computation 1, 541–551 (1989). URL https://api.semanticscholar.org/CorpusID:41312633.
  64. Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs (2018). URL http://github.com/google/jax.
  65. Tange, O. Gnu parallel 20240122 (’frederik x’) (2023). URL https://doi.org/10.5281/zenodo.10558745. GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chang Sun (32 papers)
  2. Thea K. Årrestad (1 paper)
  3. Vladimir Loncar (32 papers)
  4. Jennifer Ngadiuba (28 papers)
  5. Maria Spiropulu (58 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com