SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization (2402.17902v2)
Abstract: Neural network pruning is a key technique towards engineering large yet scalable, interpretable, and generalizable models. Prior work on the subject has developed largely along two orthogonal directions: (1) differentiable pruning for efficiently and accurately scoring the importance of parameters, and (2) combinatorial optimization for efficiently searching over the space of sparse models. We unite the two approaches, both theoretically and empirically, to produce a coherent framework for structured neural network pruning in which differentiable pruning guides combinatorial optimization algorithms to select the most important sparse set of parameters. Theoretically, we show how many existing differentiable pruning techniques can be understood as nonconvex regularization for group sparse optimization, and prove that for a wide class of nonconvex regularizers, the global optimum is unique, group-sparse, and provably yields an approximate solution to a sparse convex optimization problem. The resulting algorithm that we propose, SequentialAttention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the ImageNet and Criteo datasets.
- Reparameterizing mirror descent as gradient descent. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020a. URL https://proceedings.neurips.cc/paper/2020/hash/604b37ea63ea51fa5fb3d8a89ec056e6-Abstract.html.
- Winnowing with gradient descent. In Abernethy, J. D. and Agarwal, S. (eds.), Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pp. 163–182. PMLR, 2020b. URL http://proceedings.mlr.press/v125/amid20a.html.
- Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, 2017. doi: 10.1145/3005348. URL https://doi.org/10.1145/3005348.
- Performance of ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization for sparse convex optimization. CoRR, abs/2307.07405, 2023. doi: 10.48550/ARXIV.2307.07405. URL https://doi.org/10.48550/arXiv.2307.07405.
- Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
- PDP: parameter-free differentiable pruning is all you need. CoRR, abs/2305.11203, 2023. doi: 10.48550/ARXIV.2305.11203. URL https://doi.org/10.48550/arXiv.2305.11203.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Attribution modeling increases efficiency of bidding in display advertising. In Proceedings of the ADKDD’17, pp. 1–6. 2017.
- Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2943–2952. PMLR, 2020. URL http://proceedings.mlr.press/v119/evci20a.html.
- Variable selection is hard. In Grünwald, P., Hazan, E., and Kale, S. (eds.), Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pp. 696–709. JMLR.org, 2015. URL http://proceedings.mlr.press/v40/Foster15.html.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp. 10323–10337. PMLR, 2023.
- The fine-grained hardness of sparse linear regression. CoRR, abs/2106.03131, 2021. URL https://arxiv.org/abs/2106.03131.
- Data-efficient structured pruning via submodular optimization. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ed5854c456e136afa3faa5e41b1f3509-Abstract-Conference.html.
- Learning both weights and connections for efficient neural network. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1135–1143, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/ae0eb3eed39d2bcef4622b2499a05fe6-Abstract.html.
- Optimal brain surgeon and general network pruning. In Proceedings of International Conference on Neural Networks (ICNN’88), San Francisco, CA, USA, March 28 - April 1, 1993, pp. 293–299. IEEE, 1993. doi: 10.1109/ICNN.1993.298572. URL https://doi.org/10.1109/ICNN.1993.298572.
- Hoff, P. D. Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization. Computational Statistics & Data Analysis, 115:186–198, 2017.
- Operation-aware soft channel pruning using differentiable masks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5122–5131. PMLR, 2020. URL http://proceedings.mlr.press/v119/kang20a.html.
- Karnin, E. D. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans. Neural Networks, 1(2):239–242, 1990. doi: 10.1109/72.80236. URL https://doi.org/10.1109/72.80236.
- Cap: Correlation-aware pruning for highly-accurate sparse vision models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Accurate neural network pruning requires rethinking sparse optimization. arXiv preprint arXiv:2308.02060, 2023b.
- Optimal brain damage. In Touretzky, D. S. (ed.), Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp. 598–605. Morgan Kaufmann, 1989. URL http://papers.nips.cc/paper/250-optimal-brain-damage.
- DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1eYHoC5FX.
- Sparse training via boosting pruning plasticity with neuroregeneration. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 9908–9922, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/5227b6aaf294f5f027273aebf16015f2-Abstract.html.
- S2TA: exploiting structured sparsity for energy-efficient mobile CNN acceleration. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022, Seoul, South Korea, April 2-6, 2022, pp. 573–586. IEEE, 2022. doi: 10.1109/HPCA53966.2022.00049. URL https://doi.org/10.1109/HPCA53966.2022.00049.
- HRBP: Hardware-friendly regrouping towards block-based pruning for sparse CNN training. In Conference on Parsimony and Learning (Proceedings Track), 2023. URL https://openreview.net/forum?id=VP1Xrdz0Bp.
- Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227–234, 1995. ISSN 0097-5397. doi: 10.1137/S0097539792240406. URL https://doi.org/10.1137/S0097539792240406.
- AC/DC: alternating compressed/decompressed training of deep neural networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 8557–8570, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/48000647b315f6f00f913caa757a70b3-Abstract.html.
- Channel permutations for N: M sparsity. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 13316–13327, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a-Abstract.html.
- Hardness and algorithms for robust and sparse optimization. In International Conference on Machine Learning, pp. 17926–17944. PMLR, 2022.
- Log-sum enhanced sparse deep neural network. Neurocomputing, 407:206–220, 2020. doi: 10.1016/J.NEUCOM.2020.04.118. URL https://doi.org/10.1016/j.neucom.2020.04.118.
- Differentiable mask for pruning convolutional and recurrent networks. In 17th Conference on Computer and Robot Vision, CRV 2020, Ottawa, ON, Canada, May 13-15, 2020, pp. 222–229. IEEE, 2020. doi: 10.1109/CRV50864.2020.00037. URL https://doi.org/10.1109/CRV50864.2020.00037.
- An affine scaling methodology for best basis selection. IEEE Trans. Signal Process., 47(1):187–200, 1999. doi: 10.1109/78.738251. URL https://doi.org/10.1109/78.738251.
- Movement pruning: Adaptive sparsity by fine-tuning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/eae15aabaa768ae4a5993a8a4f4fa6e4-Abstract.html.
- Winning the lottery with continuous sparsification. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/83004190b1793d7aa15f8d0d49a13eba-Abstract.html.
- Powerpropagation: A sparsity inducing weight reparameterisation. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 28889–28903, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/f1e709e6aef16ba2f0cd6c7e4f52b9b6-Abstract.html.
- Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, 2012. doi: 10.1561/2200000018. URL https://doi.org/10.1561/2200000018.
- Woodfisher: Efficient second-order approximation for neural network compression. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/d1ff1ec86b62cd5f3903ff19c3a326b2-Abstract.html.
- Ström, N. Sparse connection and pruning in large dynamic artificial neural networks. In Kokkinakis, G., Fakotakis, N., and Dermatas, E. (eds.), Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece, September 22-25, 1997, pp. 2807–2810. ISCA, 1997. doi: 10.21437/EUROSPEECH.1997-708. URL https://doi.org/10.21437/Eurospeech.1997-708.
- Connecting optimization and regularization paths. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 10631–10641, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/6459257ddab7b85bf4b57845e875e4d4-Abstract.html.
- Evaluating pruning methods. In Proceedings of the International Symposium on Artificial neural networks, pp. 20–25, 1995.
- Tugnait, J. K. Sparse-group log-sum penalized graphical model learning for time series. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 5822–5826. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9747446. URL https://doi.org/10.1109/ICASSP43922.2022.9747446.
- Are straight-through gradients and soft-thresholding all you need for sparse training? In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pp. 3797–3806. IEEE, 2023. doi: 10.1109/WACV56688.2023.00380. URL https://doi.org/10.1109/WACV56688.2023.00380.
- Implicit regularization for optimal sparse recovery. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 2968–2979, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5cf21ce30208cfffaa832c6e44bb567d-Abstract.html.
- Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5797–5808. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1580. URL https://doi.org/10.18653/v1/p19-1580.
- Learning structured sparsity in deep neural networks. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2074–2082, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/41bfd20a38bb1b0bec75acf0845530a7-Abstract.html.
- Solving sparse linear inverse problems: Analysis of reweighted l1 and l2 methods. In SPARS’09-Signal Processing with Adaptive Sparse Structured Representations, 2009.
- Autoprune: Automatic network pruning by regularizing auxiliary parameters. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 13681–13691, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4efc9e02abdab6b6166251918570a307-Abstract.html.
- Structured pruning of convolutional neural networks via L1 regularization. IEEE Access, 7:106385–106394, 2019. doi: 10.1109/ACCESS.2019.2933032. URL https://doi.org/10.1109/ACCESS.2019.2933032.
- Sequential attention for feature selection. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=TTLLGx3eet.
- Optg: Optimizing gradient-driven criteria in network sparsity. arXiv preprint arXiv:2201.12826, 2022.
- An iterative threshold algorithm of log-sum regularization for sparse problem. IEEE Trans. Circuits Syst. Video Technol., 33(9):4728–4740, 2023. doi: 10.1109/TCSVT.2023.3247944. URL https://doi.org/10.1109/TCSVT.2023.3247944.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.