Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Optimal Compression: Joint Pruning and Quantization (2302.07612v2)

Published 15 Feb 2023 in cs.LG

Abstract: Model compression is instrumental in optimizing deep neural network inference on resource-constrained hardware. The prevailing methods for network compression, namely quantization and pruning, have been shown to enhance efficiency at the cost of performance. Determining the most effective quantization and pruning strategies for individual layers and parameters remains a challenging problem, often requiring computationally expensive and ad hoc numerical optimization techniques. This paper introduces FITCompress, a novel method integrating layer-wise mixed-precision quantization and unstructured pruning using a unified heuristic approach. By leveraging the Fisher Information Metric and path planning through compression space, FITCompress optimally selects a combination of pruning mask and mixed-precision quantization configuration for a given pre-trained model and compression constraint. Experiments on computer vision and natural language processing benchmarks demonstrate that our proposed approach achieves a superior compression-performance trade-off compared to existing state-of-the-art methods. FITCompress stands out for its principled derivation, making it versatile across tasks and network architectures, and represents a step towards achieving optimal compression for neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence, 3(8):675–686, jun 2021.
  2. Edgenas: Discovering efficient neural architectures for edge systems. In 2020 IEEE 38th International Conference on Computer Design (ICCD), pages 288–295, 2020.
  3. Mobile edge computing enabled 5g health monitoring for internet of medical things: A decentralized game theoretic approach. IEEE Journal on Selected Areas in Communications, 39(2):463–478, 2021.
  4. Deep learning in the era of edge computing: Challenges and opportunities, 2020.
  5. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
  6. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
  7. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
  8. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016.
  9. How well do sparse imagenet models transfer? CoRR, abs/2111.13445, 2021.
  10. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5533–5543, Virtual, 13–18 Jul 2020. PMLR.
  11. Accelerating sparse deep neural networks, 2021.
  12. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  13. ultralytics/yolov5: v7.0 - yolov5 sota realtime instance segmentation, 2022.
  14. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  15. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
  16. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  17. The state of sparsity in deep neural networks, 2019.
  18. Steven A. Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600–6603, 1989.
  19. Layer-adaptive sparsity for the magnitude-based pruning, 2021.
  20. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  21. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
  22. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in neural information processing systems, 1, 1988.
  23. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  24. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020.
  25. Platon: Pruning large transformer models with upper confidence bound of weight importance, 2022.
  26. M-fac: Efficient matrix-free approximations of second-order information, 2021.
  27. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  28. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  29. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389, 2020.
  30. Woodfisher: Efficient second-order approximation for neural network compression, 2020.
  31. Faster gaze prediction with dense networks and fisher pruning. arXiv preprint arXiv:1801.05787, 2018.
  32. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
  33. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pages 11875–11886. PMLR, 2021.
  34. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
  35. A practical mixed precision algorithm for post-training quantization, 2023.
  36. Mixed precision post training quantization of neural networks with sensitivity guided search, 2023.
  37. Ompq: Orthogonal mixed precision quantization, 2022.
  38. Mixed-precision neural network quantization via learned layer-wise importance, 2023.
  39. Releq: A reinforcement learning approach for deep quantization of neural networks, 2020.
  40. Channel-wise hessian aware trace-weighted quantization of neural networks, 2020.
  41. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
  42. Single path one-shot neural architecture search with uniform sampling, 2020.
  43. Fit: A metric for model sensitivity. arXiv preprint arXiv:2210.08502, 2022.
  44. Bayesian compression for deep learning, 2017.
  45. Clip-q: Deep network compression learning by in-parallel pruning-quantization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  46. Bayesian bits: Unifying quantization and pruning. Advances in neural information processing systems, 33:5741–5752, 2020.
  47. Differentiable joint pruning and quantization for hardware efficiency. In European Conference on Computer Vision, pages 259–277. Springer, 2020.
  48. Optimal brain compression: A framework for accurate post-training quantization and pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  49. Opq: Compressing deep neural networks with one-shot pruning-quantization, 2022.
  50. Joint pruning & quantization for extremely sparse neural networks, 2020.
  51. Training deep neural networks with joint quantization and pruning of weights and activations, 2021.
  52. Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
  53. Frank Nielsen. An elementary introduction to information geometry. Entropy, 22(10):1100, 2020.
  54. Rao’s distance measure. Sankhyā: The Indian Journal of Statistics, Series A, pages 345–365, 1981.
  55. Artificial Intelligence: A Modern Approach (4th Edition). Pearson, 2020.
  56. Computational complexity evaluation of neural network applications in signal processing, 2022.
  57. Uniq: Uniform noise injection for non-uniform quantization of neural networks. ACM Transactions on Computer Systems (TOCS), 37(1-4):1–15, 2021.
  58. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  59. Learning multiple layers of features from tiny images. 2009.
  60. Towards optimal structured cnn pruning via generative adversarial learning, 2019.
  61. Auto-balanced filter pruning for efficient convolutional neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
  62. Leveraging filter correlations for deep model compression, 2020.
  63. Deep model compression based on the training history, 2022.
  64. Mixed precision dnns: All you need is a good parametrization. arXiv preprint arXiv:1905.11452, 2019.
  65. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  66. More is less: A more complicated network with less inference complexity, 2017.
  67. Soks: Automatic searching of the optimal kernel shapes for stripe-wise network pruning. IEEE Transactions on Neural Networks and Learning Systems, pages 1–13, 2022.
  68. Asks: Convolution with any-shape kernels for efficient neural networks. Neurocomputing, 446:32–49, 2021.
  69. Bcnet: Searching for network width with bilaterally coupled network, 2021.
  70. Provable filter pruning for efficient neural networks, 2020.
  71. Neural network pruning by cooperative coevolution, 2022.
  72. Compressing neural networks using the variational information bottleneck, 2018.
  73. Pytorch: An imperative style, high-performance deep learning library, 2019.
  74. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning, pages 5958–5968. PMLR, 2020.
  75. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  76. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
  77. A broad-coverage challenge corpus for sentence understanding through inference, 2018.
  78. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  79. Squad: 100,000+ questions for machine comprehension of text, 2016.
  80. A fast post-training pruning framework for transformers, 2022.
  81. Ma-bert: Towards matrix arithmetic-only bert inference by eliminating complex non-linear functions. In The Eleventh International Conference on Learning Representations, 2023.
  82. Microsoft coco: Common objects in context, 2015.
  83. Michela Paganini. Prune responsibly, 2020.
  84. Learned step size quantization, 2020.
  85. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  86. Fast sparse convnets, 2019.
  87. Sparse gpu kernels for deep learning, 2020.
  88. Structured pruning is all you need for pruning cnns at initialization, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ben Zandonati (3 papers)
  2. Glenn Bucagu (1 paper)
  3. Adrian Alan Pol (13 papers)
  4. Maurizio Pierini (85 papers)
  5. Olya Sirkin (5 papers)
  6. Tal Kopetz (6 papers)
Citations (2)