Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pruning vs Quantization: Which is Better? (2307.02973v2)

Published 6 Jul 2023 in cs.LG
Pruning vs Quantization: Which is Better?

Abstract: Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.

A Comparative Analysis of Pruning and Quantization in Neural Network Compression

The paper "Pruning vs Quantization: Which is Better?" provides a detailed inquiry into the efficiencies of pruning and quantization techniques in compressing deep neural networks (DNNs). The authors aim to delineate which of these methods proves more effective, focusing on their potential impact on neural network accuracy.

Analytical Comparisons

The paper begins with an analytical exploration of both techniques. Quantization, which reduces the bit-width for weights and computations, generally provides predictable memory savings and computational reductions. Conversely, pruning eliminates certain weights, thereby affecting both the memory footprint and computational load during inference.

Using signal-to-noise ratio (SNR) as a key metric, the authors analyze the mean-squared error (MSE) introduced by each method. This analytical framework provides a theoretical basis for understanding the underlying trade-offs. The results of these analyses suggest that quantization possesses a superior SNR in moderate compression scenarios, particularly when weights are Gaussian-like.

Experimental Evaluations

The paper extends to empirical evaluations using real data from pre-trained models across various scales. A notable finding across these models is that quantization outpaces pruning in scenarios except those involving extreme compression ratios. Here, pruning might be preferable due to its distinct handling of distribution tails but at a cost to model performance, which is often prohibitive.

Post-Training and Fine-Tuning Scenarios

In post-training conditions, the paper leverages theoretical bounds, using semi-definite programming (SDP) to assess quantization errors and exact solutions for pruning in manageable scenarios. This methodology avoids biases inherent to specific optimization algorithms and offers a clearer picture of the potential effectiveness of both techniques.

Under fine-tuning conditions, the quantization-aware training (QAT) method LSQ consistently outperformed pruning methods in maintaining model accuracy across various tasks, even under equal compression ratios. Pruning primarily serves well at extremely low bit-widths, which are not typically operational due to significant accuracy drops.

Implications and Future Directions

Practically, the findings advocate for prioritizing quantization in neural network deployments where computational efficiency and accuracy are paramount. The potential of intrinsic sparsity in quantized tensors suggests additional avenues for optimization without complicating hardware requirements.

The paper hints at future areas of research, including exploring combinations of pruning and quantization. Despite the potential theoretical advantages of these combinations, further practical investigations are required to assess their feasibility and impact across diverse models and architectures.

Conclusion

This research presents a comprehensive comparison between pruning and quantization, illustrating the consistent edge that quantization holds in most practical compression scenarios. The emphasis on accurate SNR assessments and the dual focus on both theoretical and empirical analyses significantly contribute to its usefulness for hardware-aware model compression strategies. Without focusing on the hardware specifics intensely, the paper provides essential insights for researchers and practitioners looking to optimize neural network performance efficiently.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, 2019.
  2. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.
  3. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  4. Ellipsoid bounds for convex quadratic integer programming. SIAM Journal on Optimization, 25(2):741–769, 2015.
  5. Zeroq: A novel zero shot quantization framework. arXiv preprint arXiv:2001.00281, 2020.
  6. Rethinking atrous convolution for semantic image segmentation, 2017.
  7. PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085, 2018.
  8. Low-bit quantization of neural networks for efficient inference. International Conference on Computer Vision (ICCV), 2019.
  9. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
  10. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
  11. Learned step size quantization. In International Conference on Learning Representations (ICLR), 2020.
  12. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  13. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Springer New York, New York, NY, 2013.
  14. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  15. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
  16. Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
  17. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023.
  18. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  19. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  20. Deep learning with limited numerical precision. In International Conference on Machine Learning, (ICML), 2015.
  21. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
  22. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  23. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  24. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.
  25. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 2016.
  26. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397, 2017.
  27. Searching for mobilenetv3. In International Conference on Computer Vision (ICCV), 2019.
  28. Opq: Compressing deep neural networks with one-shot pruning-quantization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7780–7788, 2021.
  29. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021.
  30. An empirical comparison of quantization, pruning and low-rank neural network compression using the lc toolkit. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  31. Quantization and training of neural networks for efficient integer-arithmetic-only inference. Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  32. Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arxiv preprint arxiv:1903.08066, 2019.
  33. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
  34. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  35. A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307, 2019.
  36. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  37. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021.
  38. Fixed point quantization of deep convolutional networks. In International conference on machine learning, pages 2849–2858. PMLR, 2016.
  39. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, 2016.
  40. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  41. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019.
  42. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  43. Learning sparse neural networks through l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization. International Conference on Learning Representations (ICLR), 2018.
  44. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, 2022.
  45. Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
  46. A white paper on neural network quantization. ArXiv, abs/2106.08295, 2021.
  47. Overcoming oscillations in quantization-aware training. In International Conference on Machine Learning, pages 16318–16330. PMLR, 2022.
  48. Data-free quantization through weight equalization and bias correction. In International Conference on Computer Vision (ICCV), 2019.
  49. Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119, 2017.
  50. A semidefinite programming method for integer convex quadratic minimization. Optimization Letters, 12:499–518, 2018.
  51. Mixed-integer quadratic programming is in np. Mathematical Programming, 162:225–240, 2017.
  52. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  53. Mobilenetv2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  54. Cyclical pruning for sparse neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2762–2771, 2022.
  55. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), 2019.
  56. Efficientdet: Scalable and efficient object detection, 2020.
  57. Deep neural network compression by in-parallel pruning-quantization. IEEE transactions on pattern analysis and machine intelligence, 42(3):568–579, 2018.
  58. Mixed precision dnns: All you need is a good parametrization. In International Conference on Learning Representations (ICLR), 2020.
  59. Bayesian bits: Unifying quantization and pruning. arXiv preprint arXiv:2005.07093, 2020.
  60. Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2178–2188, 2020.
  61. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9194–9203, 2018.
  62. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2015.
  63. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  64. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Andrey Kuzmin (8 papers)
  2. Markus Nagel (33 papers)
  3. Mart van Baalen (18 papers)
  4. Arash Behboodi (44 papers)
  5. Tijmen Blankevoort (37 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com