Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time (2402.03625v3)
Abstract: In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-time algorithm that is guaranteed to solve the original non-convex problem up to a logarithmic factor. Moreover, under mild assumptions, we show that local gradient methods converge to a point with low training loss with high probability. Our result is an exponential improvement compared to existing results and sheds new light on understanding why local gradient methods work well.
- Yann LeCun, Yoshua Bengio and Geoffrey Hinton “Deep learning” In nature 521.7553 Nature Publishing Group UK London, 2015, pp. 436–444
- “Visualizing the loss landscape of neural nets” In Advances in neural information processing systems 31, 2018
- Léon Bottou “Large-scale machine learning with stochastic gradient descent” In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, 2010, pp. 177–186 Springer
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: Convergence and generalization in neural networks” In Advances in neural information processing systems 31, 2018
- “Gradient descent finds global minima of deep neural networks” In International conference on machine learning, 2019, pp. 1675–1685 PMLR
- “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks” In International Conference on Machine Learning, 2019, pp. 322–332 PMLR
- “Gradient descent optimizes over-parameterized deep ReLU networks” In Machine learning 109 Springer, 2020, pp. 467–492
- “Stochastic gradient descent optimizes over-parameterized deep relu networks” In arXiv preprint arXiv:1811.08888, 2018
- “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks” In arXiv preprint arXiv:1909.12292, 2019
- Digvijay Boob, Santanu S Dey and Guanghui Lan “Complexity of training relu neural network” In Discrete Optimization 44 Elsevier, 2022, pp. 100620
- “Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks” In International Conference on Machine Learning, 2020, pp. 7695–7705 PMLR
- Yifei Wang, Tolga Ergen and Mert Pilanci “Parallel Deep Neural Networks Have Zero Duality Gap” In The Eleventh International Conference on Learning Representations, 2022
- Yifei Wang, Jonathan Lacotte and Mert Pilanci “The hidden convex optimization landscape of regularized two-layer relu networks: an exact characterization of optimal solutions” In International Conference on Learning Representations, 2021
- “Learning overparameterized neural networks via stochastic gradient descent on structured data” In Advances in neural information processing systems 31, 2018
- “Implicit convex regularizers of cnn architectures: Convex optimization of two-and three-layer networks in polynomial time” In arXiv preprint arXiv:2006.14798, 2020
- “Unraveling attention via convex duality: Analysis and interpretations of vision transformers” In International Conference on Machine Learning, 2022, pp. 19050–19088 PMLR
- “Global optimality beyond two layers: Training deep relu networks via convex programs” In International Conference on Machine Learning, 2021, pp. 2993–3003 PMLR
- “Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms” In arXiv preprint arXiv:2012.13329, 2020
- “Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes” In arXiv preprint arXiv:2311.10972, 2023
- “Tight hardness results for training depth-2 ReLU networks” In arXiv preprint arXiv:2011.13550, 2020
- Santanu S Dey, Guanyi Wang and Yao Xie “Approximation algorithms for training one-node ReLU neural networks” In IEEE Transactions on Signal Processing 68 IEEE, 2020, pp. 6696–6706
- “Understanding deep neural networks with rectified linear units” In arXiv preprint arXiv:1611.01491, 2016
- “The computational complexity of training relu (s)” In arXiv preprint arXiv:1810.04207, 2018
- Ainesh Bakshi, Rajesh Jayaram and David P Woodruff “Learning two layer rectified neural networks in polynomial time” In Conference on Learning Theory, 2019, pp. 195–268 PMLR
- Pranjal Awasthi, Alex Tang and Aravindan Vijayaraghavan “Efficient algorithms for learning depth-2 neural networks with general relu activations” In Advances in Neural Information Processing Systems 34, 2021, pp. 13485–13496
- Yatong Bai, Tanmay Gautam and Somayeh Sojoudi “Efficient global optimization of two-layer relu networks: Quadratic-time algorithms and adversarial training” In SIAM Journal on Mathematics of Data Science 5.2 SIAM, 2023, pp. 446–474
- “No bad local minima: Data independent training error guarantees for multilayer neural networks” In arXiv preprint arXiv:1605.08361, 2016
- Thomas M. Cover “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition” In IEEE Transactions on Electronic Computers EC-14.3, 1965, pp. 326–334 DOI: 10.1109/PGEC.1965.264137
- “The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime” In arXiv preprint arXiv:1911.01544, 2019
- Michael Celentano, Chen Cheng and Andrea Montanari “The high-dimensional asymptotics of first order methods with random data” In arXiv preprint arXiv:2112.07572, 2021
- “Fundamental barriers to high-dimensional regression with convex penalties” In The Annals of Statistics 50.1 Institute of Mathematical Statistics, 2022, pp. 170–196
- Florian A Potra and Stephen J Wright “Interior-point methods” In Journal of computational and applied mathematics 124.1-2 Elsevier, 2000, pp. 281–302
- Aaron Mishkin, Arda Sahiner and Mert Pilanci “Fast convex optimization for two-layer relu networks: Equivalent model classes and cone decompositions” In International Conference on Machine Learning, 2022, pp. 15770–15816 PMLR
- Stephen P Boyd and Lieven Vandenberghe “Convex optimization” Cambridge university press, 2004
- Joel A Tropp “An introduction to matrix concentration inequalities” In Foundations and Trends® in Machine Learning 8.1-2 Now Publishers, Inc., 2015, pp. 1–230
- Nick Harvey “Lecture 2: Matrix Chernoff bounds”
- Christos Thrampoulidis, Samet Oymak and Babak Hassibi “The Gaussian min-max theorem in the presence of convexity” In arXiv preprint arXiv:1408.4837, 2014
- Noureddine El Karoui “The spectrum of kernel random matrices”, 2010