Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What to Do When Your Discrete Optimization Is the Size of a Neural Network? (2402.10339v1)

Published 15 Feb 2024 in cs.LG

Abstract: Oftentimes, machine learning applications using neural networks involve solving discrete optimization problems, such as in pruning, parameter-isolation-based continual learning and training of binary networks. Still, these discrete problems are combinatorial in nature and are also not amenable to gradient-based optimization. Additionally, classical approaches used in discrete settings do not scale well to large neural networks, forcing scientists and empiricists to rely on alternative methods. Among these, two main distinct sources of top-down information can be used to lead the model to good solutions: (1) extrapolating gradient information from points outside of the solution set (2) comparing evaluations between members of a subset of the valid solutions. We take continuation path (CP) methods to represent using purely the former and Monte Carlo (MC) methods to represent the latter, while also noting that some hybrid methods combine the two. The main goal of this work is to compare both approaches. For that purpose, we first overview the two classes while also discussing some of their drawbacks analytically. Then, on the experimental section, we compare their performance, starting with smaller microworld experiments, which allow more fine-grained control of problem variables, and gradually moving towards larger problems, including neural network regression and neural network pruning for image classification, where we additionally compare against magnitude-based pruning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (132)
  1. Introduction to numerical continuation methods. Society for Industrial and Applied Mathematics, 2003.
  2. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media, 2012.
  3. Improved gradient-based optimization over discrete distributions. arXiv:1810.00116, 2018.
  4. Siu-Kui Au and JL Beck. Important sampling in high dimensions. Structural Safety, 25(2):139–163, 2003.
  5. Learned threshold pruning. arXiv:2003.00075, 2020.
  6. Deep rewiring: training very sparse deep networks. arXiv:1711.05136, 2017.
  7. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013.
  8. Back to simplicity: how to train accurate BNNs from scratch? arXiv:1906.08637, 2019.
  9. Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  10. Pseudo-Boolean optimization. Discrete Applied Mathematics, 123(1-3):155–225, 2002.
  11. Optimal cell flipping to minimize channel density in VLSI design and pseudo-Boolean optimization. Discrete Applied Mathematics, 90(1-3):69–88, 1999.
  12. Convex optimization. Cambridge university press, 2004.
  13. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001.
  14. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  15. Improved training of binary networks for human pose estimation and image recognition. arXiv:1904.05868, 2019.
  16. High-capacity expert binary networks. arXiv:2010.03558, 2020.
  17. Rich Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.
  18. An introduction to optimization, volume 75. John Wiley & Sons, 2013.
  19. Binaryconnect: training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems, 28, 2015.
  20. Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv:2009.09796, 2020.
  21. On the max flow min cut theorem of networks. Linear Inequalities and Related Systems, 38:225–231, 2003.
  22. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Advances in Neural Information Processing Systems, 35:12760–12774, 2022.
  23. A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021.
  24. ARMS: antithetic-REINFORCE-multi-sample gradient for binary variables. In International Conference on Machine Learning, pages 2717–2727. PMLR, 2021.
  25. Yefim Dinitz. Algorithm for solution of a problem of maximum flow in networks with power estimation. Soviet Mathematics Doklady, 11:1277–1280, 01 1970.
  26. DisARM: an antithetic gradient estimator for binary latent variables. Advances in Neural Information Processing Systems, 33:18637–18647, 2020.
  27. Coupled gradient estimators for discrete latent variables. Advances in Neural Information Processing Systems, 34:24498–24508, 2021.
  28. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM, 19(2):248–264, 1972.
  29. Fast second order stochastic backpropagation for variational inference. Advances in Neural Information Processing Systems, 28, 2015.
  30. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  31. Pathnet: Evolution channels gradient descent in super neural networks. arXiv:1701.08734, 2017.
  32. The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv:1803.03635, 2018.
  33. Robert M French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
  34. The state of sparsity in deep neural networks. arXiv:1902.09574, 2019.
  35. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  36. Paul Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer, 2004.
  37. Fred Glover. Tabu search—part i. ORSA Journal on Computing, 1(3):190–206, 1989.
  38. Fred Glover. Tabu search—part ii. ORSA Journal on Computing, 2(1):4–32, 1990.
  39. Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  40. Backpropagation through the void: optimizing control variates for black-box gradient estimation. arXiv:1711.00123, 2017.
  41. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1:169–197, 1981.
  42. Muprop: unbiased backpropagation for stochastic neural networks. arXiv:1511.05176, 2015.
  43. Soft actor-critic algorithms and applications. arXiv:1812.05905, 2018.
  44. Approximations of pseudo-Boolean functions; applications to game theory. Zeitschrift für Operations Research, 36(1):3–21, 1992.
  45. Applications of pseudo-Boolean methods to economic problems. Theory and Decision, 1:296–308, 1971.
  46. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28, 2015.
  47. Pierre Hansen. The steepest ascent mildest descent heuristic for combinatorial programming. In Congress on Numerical Methods in Combinatorial Optimization, pages 70–145, 1986.
  48. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
  49. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005, 2021.
  50. Handbook of global optimization, volume 2. Springer Science & Business Media, 2013.
  51. Binarized neural networks. Advances in Neural Information Processing Systems, 29, 2016.
  52. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456. PMLR, 2015.
  53. Categorical reparameterization with Gumbel-softmax. arXiv:1611.01144, 2016.
  54. How easy is local search? Journal of Computer and System Sciences, 37(1):79–100, 1988.
  55. Quadratic 0/1 optimization and a decomposition approach for the placement of electronic circuits. Mathematical Programming, 63:257–279, 1994.
  56. Richard M Karp. Reducibility among combinatorial problems. Springer, 2010.
  57. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  58. Buy 4 reinforce samples, get a baseline for free! International Conference on Learning Representations Workshop, 2019.
  59. Combinatorial optimization, volume 1. Springer, 2011.
  60. Genetic algorithms. Springer, 2017.
  61. Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 04 2009.
  62. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  63. Wieslaw Kubiak. New results on the completion time variance minimization. Discrete Applied Mathematics, 58(2):157–168, 1995.
  64. Exploration in deep reinforcement learning: a survey. Information Fusion, 85:1–22, 2022.
  65. Model-free policy learning with reward gradients. arXiv:2103.05147, 2021.
  66. Branch-and-bound methods: a survey. Operations Research, 14(4):699–719, 1966.
  67. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  68. A tutorial on energy-based learning. Predicting Structured Data, 1(0), 2006.
  69. A signal propagation perspective for pruning neural networks at initialization. arXiv:1906.06307, 2019.
  70. Gshard: scaling giant models with conditional computation and automatic sharding. arXiv:2006.16668, 2020.
  71. Curse-of-dimensionality revisited: collapse of importance sampling in very large scale systems. Rapport Technique, 85:205, 2005.
  72. L0-ARM: network sparsification via stochastic binary optimization. In The European Conference on Machine Learning, pages 432–448. Springer, 2020.
  73. Direct optimization through argmax for discrete variational auto-encoder. Advances in Neural Information Processing Systems, 32, 2019.
  74. Learning sparse neural networks through L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization. arXiv:1712.01312, 2017.
  75. Autopruner: an end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition, 107:107461, 2020.
  76. A* sampling. Advances in Neural Information Processing Systems, 27, 2014.
  77. The concrete distribution: a continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
  78. T Soni Madhulatha. An overview on clustering methods. arXiv:1205.1117, 2012.
  79. Online continual learning in image classification: an empirical survey. Neurocomputing, 469:28–51, 2022.
  80. Packnet: adding multiple tasks to a single network by iterative pruning. In Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  81. Training binary neural networks with real-to-binary convolutions. arXiv:2003.11535, 2020.
  82. Structured pruning of a BERT-based question answering model. arXiv:1910.06360, 2019.
  83. Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
  84. Experimental evaluation of an adiabiatic quantum system for combinatorial optimization. In Proceedings of the ACM International Conference on Computing Frontiers, pages 1–11, 2013.
  85. Rahul Mehta. Sparse transfer learning via winning lottery tickets. arXiv:1905.07785, 2019.
  86. Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33:21130–21140, 2020.
  87. Monte Carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(132):1–62, 2020.
  88. Ashley Montanaro. Quantum algorithms: an overview. npj Quantum Information, 2(1):1–8, 2016.
  89. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. Advances in Neural Information Processing Systems, 32, 2019.
  90. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics, pages 489–498. PMLR, 2017.
  91. Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.
  92. Art B Owen. Monte Carlo theory, methods and examples, 2013.
  93. Continual lifelong learning with neural networks: a review. Neural Networks, 113:54–71, 2019.
  94. Gradient estimation with stochastic softmax tricks. Advances in Neural Information Processing Systems, 33:5691–5704, 2020a.
  95. Rao-blackwellizing the straight-through Gumbel-softmax gradient estimator. arXiv:2010.04838, 2020b.
  96. Optimal lottery tickets via subset sum: logarithmic overparameterization is sufficient. Advances in Neural Information Processing Systems, 33:2599–2610, 2020.
  97. A quadratic assignment formulation of the molecular conformation problem. Journal of Global Optimization, 4:229–241, 1994.
  98. A cut approach to the rectilinear distance facility location problem. Operations Research, 26(3):422–433, 1978.
  99. Binary neural networks: a survey. Pattern Recognition, 105:107281, 2020.
  100. Vargrad: a low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33:13481–13492, 2020.
  101. Ivo G Rosenberg. Reduction of bivalent maximization to the quadratic case. 1975.
  102. Optimization of static simulation models by the score function method. Mathematics and Computers in Simulation, 32(4):373–392, 1990.
  103. The generalized reparameterization gradient. Advances in Neural Information Processing Systems, 29, 2016.
  104. Facts, conjectures, and improvements for simulated annealing. Society for Industrial and Applied Mathematics, 2002.
  105. Movement pruning: adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  106. Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33:11380–11390, 2020.
  107. Local search characteristics of incomplete SAT procedures. Artificial Intelligence, 132(2):121–150, 2001.
  108. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  109. Reintroducing straight-through estimators as principled methods for stochastic binary networks. In DAGM German Conference on Pattern Recognition, pages 111–126. Springer, 2021.
  110. Gradient estimation with discrete Stein operators. Advances in Neural Information Processing Systems, 35:25829–25841, 2022.
  111. Training sparse neural networks. In Conference on Computer Vision and Pattern Recognition Workshops, pages 138–145, 2017.
  112. Richard Sutton. Two problems with backpropagation and other steepest descent learning procedures for networks. In Proceedings of the Annual Conference of the Cognitive Science Society, pages 823–832, 1986.
  113. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
  114. Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022.
  115. A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.
  116. Gabriel Tavares. New algorithms for Quadratic Unconstrained Binary Optimization (QUBO) with applications in engineering and social sciences. Rutgers The State University of New Jersey, School of Graduate Studies, 2008.
  117. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  118. Double control variates for gradient estimation in discrete latent variable models. In International Conference on Artificial Intelligence and Statistics, pages 6134–6151. PMLR, 2022.
  119. REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. Advances in Neural Information Processing Systems, 30, 2017.
  120. Simulated annealing. Springer, 1987.
  121. Using winning lottery tickets in transfer learning for convolutional neural networks. In International Joint Conference on Neural Networks, pages 1–8. IEEE, 2019.
  122. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  123. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  124. The design of approximation algorithms. Cambridge university press, 2011.
  125. Variance reduction properties of the reparameterization trick. In International Conference on Artificial Intelligence and Statistics, pages 2711–2720. PMLR, 2019.
  126. ARM: augment-REINFORCE-merge gradient for stochastic binary networks. arXiv:1807.11143, 2018.
  127. ARSM: augment-REINFORCE-swap-merge estimator for gradient backpropagation through categorical variables. In International Conference on Machine Learning, pages 7095–7104. PMLR, 2019.
  128. Probabilistic best subset selection via gradient-based optimization. arXiv:2006.06448, 2020.
  129. Growing efficient deep networks by structured continuous sparsification. arXiv:2007.15353, 2020.
  130. Deconstructing lottery tickets: zeros, signs, and the supermask. Advances in Neural Information Processing Systems, 32, 2019.
  131. Effective sparsification of neural networks with global sparsity constraint. In Conference on Computer Vision and Pattern Recognition, pages 3599–3608, 2021.
  132. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv:1710.01878, 2017.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets