Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$ Regularized Latency Surrogates (2306.05785v2)

Published 9 Jun 2023 in cs.LG

Abstract: Neural network (NN) compression via techniques such as pruning, quantization requires setting compression hyperparameters (e.g., number of channels to be pruned, bitwidths for quantization) for each layer either manually or via neural architecture search (NAS) which can be computationally expensive. We address this problem by providing an end-to-end technique that optimizes for model's Floating Point Operations (FLOPs) or for on-device latency via a novel $\frac{\ell_1}{\ell_2}$ latency surrogate. Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. Crucially, it is fast and runs in almost the same amount of time as single model training; which is a significant training speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning tasks, we achieve $50\%$ reduction in FLOPs with only $1\%$ drop in performance. For compressing MobileNetV3 on ImageNet-1K, we achieve $15\%$ reduction in FLOPs, and $11\%$ reduction in on-device latency without drop in accuracy, while still requiring $3\times$ less training compute than SOTA compression techniques. Finally, for transfer learning on smaller datasets, our technique identifies $1.2\times$-$1.4\times$ cheaper architectures than standard MobileNetV3, EfficientNet suite of architectures at almost the same training cost and accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  2. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  3. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  4. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  5. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR, 2020.
  6. Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019.
  7. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  8. Model compression. In Knowledge Discovery and Data Mining, 2006.
  9. Dynamic model pruning with feedback. arXiv preprint arXiv:2006.07253, 2020.
  10. Comparing fine-tuning and rewinding in neural network pruning. In International Conference on Learning Representations, 2020.
  11. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595, 2018.
  12. Fine-grained stochastic architecture search. arXiv preprint arXiv:2006.09581, 2020.
  13. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  14. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  15. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
  16. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  17. Neural architecture search with bayesian optimisation and optimal transport. arXiv preprint arXiv:1802.07191, 2018.
  18. Can weight sharing outperform random architecture search? an investigation with tunas. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14323–14332, 2020.
  19. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
  20. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  21. A fast post-training pruning framework for transformers. arXiv preprint arXiv:2204.09656, 2022.
  22. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
  23. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
  24. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
  25. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
  26. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  27. Random search and reproducibility for neural architecture search. In Uncertainty in artificial intelligence, pages 367–377. PMLR, 2020.
  28. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pages 4095–4104. PMLR, 2018.
  29. Atomnas: Fine-grained end-to-end neural architecture search. arXiv preprint arXiv:1912.09640, 2019.
  30. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019.
  31. Discovering multi-hardware mobile models via architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3022–3031, 2021.
  32. Mcunet: Tiny deep learning on iot devices. arXiv preprint arXiv:2007.10319, 2020.
  33. Hao: Hardware-aware neural architecture optimization for efficient inference. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 50–59. IEEE, 2021.
  34. Fast hardware-aware neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 692–693, 2020.
  35. A comprehensive survey on hardware-aware neural architecture search. arXiv preprint arXiv:2101.09336, 2021.
  36. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  37. Ehud D Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE transactions on neural networks, 1(2):239–242, 1990.
  38. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  39. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019.
  40. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016.
  41. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017.
  42. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  43. Learning compact recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5960–5964. IEEE, 2016.
  44. Trained rank pruning for efficient deep neural networks. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 14–17. IEEE, 2019.
  45. Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
  46. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  47. A unified framework for high-dimensional analysis of m𝑚mitalic_m-estimators with decomposable regularizers. Advances in neural information processing systems, 22, 2009.
  48. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34:24365–24378, 2021.
  49. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
  50. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  51. Ratio and difference of l⁢_⁢1𝑙_1l\_1italic_l _ 1 and l⁢_⁢2𝑙_2l\_2italic_l _ 2 norms and sparse representation with coherent dictionaries. Communications in Information and Systems, 14(2):87–109, 2014.
  52. A scale-invariant approach for sparse signal recovery. SIAM Journal on Scientific Computing, 41(6):A3649–A3672, 2019.
  53. Accelerated schemes for the l⁢_⁢1/l⁢_⁢2𝑙_1𝑙_2l\_1/l\_2italic_l _ 1 / italic_l _ 2 minimization. IEEE Transactions on Signal Processing, 68:2660–2669, 2020.
  54. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. arXiv preprint arXiv:1908.09979, 2019.
  55. Convex optimization. Cambridge university press, 2004.
  56. Helmuth Späth. One dimensional spline interpolation algorithms. AK Peters/CRC Press, 1995.
  57. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  58. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  59. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics.
  60. EBERT: Efficient BERT inference with dynamic structured pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814–4823, Online, August 2021. Association for Computational Linguistics.
  61. Pruning redundant mappings in transformer models via spectral-normalized identity prior. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 719–730, Online, November 2020. Association for Computational Linguistics.
  62. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  63. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151–6162, Online, November 2020. Association for Computational Linguistics.
  64. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  65. On the effect of dropping layers of pre-trained transformer models. Computer Speech I& Language, 77:101429, jan 2023.
  66. Towards mixed-precision quantization of neural networks via constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5359, 2021.
  67. Mixed precision dnns: All you need is a good parametrization. arXiv preprint arXiv:1905.11452, 2019.
  68. Bayesian bits: Unifying quantization and pruning. Advances in neural information processing systems, 33:5741–5752, 2020.
  69. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  70. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  71. Halp: hardware-aware latency pruning. arXiv preprint arXiv:2110.10811, 2021.
  72. Pruning-as-search: Efficient neural architecture search via channel pruning and structural reparameterization. arXiv preprint arXiv:2206.01198, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Anshul Nasery (12 papers)
  2. Hardik Shah (12 papers)
  3. Arun Sai Suggala (18 papers)
  4. Prateek Jain (131 papers)
Citations (1)