Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rotation Invariant Quantization for Model Compression (2303.03106v3)

Published 3 Mar 2023 in cs.LG, cs.AI, cs.IT, and math.IT

Abstract: Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratios on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ accuracy degradation. Code is available in \href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. On the surprising behavior of distance metrics in high dimensional space. In Database Theory—ICDT 2001: 8th International Conference London, UK, January 4–6, 2001 Proceedings 8, pp.  420–434. Springer, 2001.
  2. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
  3. Cat: Compression-aware training for bandwidth reduction, 2019. URL https://arxiv.org/abs/1909.11481.
  4. Uniq: Uniform noise injection for non-uniform quantization of neural networks. ACM Transactions on Computer Systems (TOCS), 37(1-4):1–15, 2021.
  5. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
  6. Sqnr estimation of fixed-point dsp algorithms. EURASIP Journal on Advances in Signal Processing, 2010:1–12, 2010.
  7. How to train your dnn: The network operator edition. arXiv preprint arXiv:2004.10275, 2020.
  8. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp.  3009–3018. IEEE, 2019.
  9. Elements of information theory. Wiley-Interscience, 2006.
  10. Differentiable expected hypervolume improvement for parallel multi-objective bayesian optimization. Advances in Neural Information Processing Systems, 33:9851–9864, 2020.
  11. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement. In NeurIPS, 2021a.
  12. Multi-objective bayesian optimization over high-dimensional search spaces. arXiv preprint arXiv:2109.10964, 2021b.
  13. Differentiable model compression via pseudo quantization noise. TMLR, 2022.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Duda, J. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.
  16. High-dimensional bayesian optimization with sparse axis-aligned subspaces. In UAI, 2021.
  17. Training with quantization noise for extreme model compression. CoRR, abs/2004.07320, 2020. URL https://arxiv.org/abs/2004.07320.
  18. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 2018.
  19. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  20. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  21. Rate distortion for model compression: From theory to practice. In International Conference on Machine Learning, pp. 2102–2111. PMLR, 2019.
  22. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015. URL https://arxiv.org/abs/1510.00149.
  23. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. IEEE, 1993.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  25. (https://stephenmontgomerysmith.github.io/), S. M.-S. Finding the rotation matrix in n-dimensions. Mathematics Stack Exchange, 2016. URL https://math.stackexchange.com/q/598782. URL:https://math.stackexchange.com/q/598782 (version: 2016-06-19).
  26. Accurate post training quantization with small calibration sets. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  4466–4475. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/hubara21a.html.
  27. Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18:1059–1076, 1989.
  28. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8049–8059, 2020.
  29. Optimal quantization using scaled codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12095–12104, 2021.
  30. An information-theoretic justification for model pruning. In International Conference on Artificial Intelligence and Statistics, pp.  3821–3846. PMLR, 2022.
  31. ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference, February 2022. URL https://doi.org/10.5281/zenodo.6222936.
  32. Gaussian approximation of quantization error for estimation from compressed data. IEEE Transactions on Information Theory, 67(8):5562–5579, 2021. doi: 10.1109/TIT.2021.3083271.
  33. Koshelev, V. Quantization with minimal entropy. Probl. Pered. Inform, 14:151–156, 1963.
  34. Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. ArXiv, abs/1806.08342, 2018.
  35. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  36. Learning low-rank approximation for CNNs. arXiv preprint arXiv:1905.10145, 2019.
  37. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
  38. Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE transactions on pattern analysis and machine intelligence, 41(12):2889–2905, 2018.
  39. Microsoft coco: Common objects in context. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, pp.  740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1.
  40. The validity of the additive noise model for uniform scalar quantizers. IEEE Transactions on Information Theory, 51(5):1739–1755, 2005. doi: 10.1109/TIT.2005.846397.
  41. Melvin Dale, S. The algebra of random variables. Wiley, 1979.
  42. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
  43. NVIDIA. Pytorch quantization - functionalities. docs.nvidia.com, 2021. URL https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html.
  44. Scalable model compression by entropy penalized reparameterization. arXiv preprint arXiv:1906.06624, 2019.
  45. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2016.
  46. Information theory: From coding to learning. Book draft, 2022.
  47. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  48. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  49. Knowledge distillation beyond model compression. In 2020 25th International Conference on Pattern Recognition (ICPR), pp.  6136–6143. IEEE, 2021.
  50. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  51. Online ensemble model compression using knowledge distillation. In European Conference on Computer Vision, pp.  18–35. Springer, 2020.
  52. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8612–8620, 2019.
  53. Massively parallel ans decoding on gpus. In Proceedings of the 48th International Conference on Parallel Processing, pp.  1–10, 2019.
  54. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020.
  55. DRGS: Low-precision full quantization of deep neural network with dynamic rounding and gradient scaling for object detection. In International Conference on Data Mining and Big Data, pp. 137–151. Springer, 2022.
  56. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021.
  57. Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12145–12154, 2022a.
  58. Post-training quantization for neural networks with provable guarantees. arXiv preprint arXiv:2201.11113, 2022b.
  59. Exploration and estimation for model compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  487–496, 2021.
  60. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pp. 7543–7552. PMLR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Joseph Kampeas (5 papers)
  2. Yury Nahshan (6 papers)
  3. Hanoch Kremer (1 paper)
  4. Gil Lederman (4 papers)
  5. Shira Zaloshinski (1 paper)
  6. Zheng Li (326 papers)
  7. Emir Haleva (4 papers)
Github Logo Streamline Icon: https://streamlinehq.com