Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models (2312.12657v1)

Published 19 Dec 2023 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to refine the training process of neural networks and provide a better interpretation of their optimal weights. We focus on training two-layer neural networks with piecewise linear activations and demonstrate that they can be formulated as a finite-dimensional convex program. These programs include a regularization term that promotes sparsity, which constitutes a variant of group Lasso. We first utilize semi-infinite programming theory to prove strong duality for finite width neural networks and then we express these architectures equivalently as high dimensional convex sparse recovery models. Remarkably, the worst-case complexity to solve the convex program is polynomial in the number of samples and number of neurons when the rank of the data matrix is bounded, which is the case in convolutional networks. To extend our method to training data of arbitrary rank, we develop a novel polynomial-time approximation scheme based on zonotope subsampling that comes with a guaranteed approximation ratio. We also show that all the stationary of the nonconvex training objective can be characterized as the global optimum of a subsampled convex program. Our convex models can be trained using standard convex solvers without resorting to heuristics or extensive hyper-parameter tuning unlike non-convex methods. Through extensive numerical experiments, we show that convex models can outperform traditional non-convex methods and are not sensitive to optimizer hyperparameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
  2. Understanding deep neural networks with rectified linear units. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
  3. Implicit regularization in deep matrix factorization, 2019.
  4. Harnessing the power of infinitely wide deep nets on small-data tasks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkl8sJBYvH.
  5. Reverse search for enumeration. Discrete applied mathematics, 65(1-3):21–46, 1996.
  6. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  7. Keith Ball et al. An elementary introduction to modern convex geometry. Flavors of geometry, 31(1-58):26, 1997.
  8. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
  9. Principled deep neural network training through linear programming, 2018.
  10. Complexity of training relu neural network, 2018.
  11. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004a.
  12. Convex optimization. Cambridge university press, 2004b.
  13. E. J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Info Theory, 51(12):4203–4215, December 2005.
  14. E. J. Candes and T. Tao. The Dantzig selector: Statistical estimation when p𝑝pitalic_p is much larger than n𝑛nitalic_n. Annals of Statistics, 35(6):2313–2351, 2007.
  15. Atomic decomposition by basis pursuit. SIAM J. Sci. Computing, 20(1):33–61, 1998.
  16. A note on lazy training in supervised differentiable programming, 2018.
  17. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1305–1338. PMLR, 09–12 Jul 2020. URL https://proceedings.mlr.press/v125/chizat20a.html.
  18. Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247–262, 1975.
  19. Learning feature representations with k-means. In Neural networks: Tricks of the trade, pages 561–580. Springer, 2012.
  20. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  21. Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, EC-14(3):326–334, 1965.
  22. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  23. D. L. Donoho. Compressed sensing. IEEE Trans. Info. Theory, 52(4):1289–1306, April 2006.
  24. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  25. Constructing arrangements of lines and hyperplanes with applications. SIAM Journal on Computing, 15(2):341–363, 1986.
  26. Convex geometry of two-layer relu networks: Implicit autoencoding and interpretable models. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4024–4033, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/ergen20a.html.
  27. Implicit convex regularizers of cnn architectures: Convex optimization of two- and three-layer networks in polynomial time. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=0N8jUH4JMv6.
  28. Convex geometry and duality of over-parameterized neural networks. Journal of Machine Learning Research, 22(212):1–63, 2021b. URL http://jmlr.org/papers/v22/20-1447.html.
  29. Global optimality beyond two layers: Training deep relu networks via convex programs. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2993–3003. PMLR, 18–24 Jul 2021c. URL http://proceedings.mlr.press/v139/ergen21a.html.
  30. Revealing the structure of deep neural networks via convex duality. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3004–3014. PMLR, 18–24 Jul 2021d. URL http://proceedings.mlr.press/v139/ergen21b.html.
  31. Convexifying transformers: Improving optimization and understanding of transformer networks, 2022a. URL https://arxiv.org/abs/2211.11052.
  32. Demystifying batch normalization in reLU networks: Equivalent convex optimization models and implicit regularization. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=6XGgutacQ0B.
  33. Convex formulation of overparameterized deep neural networks, 2019.
  34. M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford, 2002. Available online: http://faculty.washington.edu/mfazel/thesis-final.pdf.
  35. Linear semi-infinite optimization. 01 1998. doi: 10.1007/978-1-4899-8044-1˙3.
  36. Reliably learning the relu in polynomial time. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
  37. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
  38. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  39. Convex polytopes, volume 16. Springer, 1967.
  40. Implicit regularization in matrix factorization, 2017.
  41. Implicit bias of gradient descent on linear convolutional networks, 2019.
  42. Convex neural autoregressive models: Towards tractable, expressive, and theoretically-backed models for sequential forecasting and generation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3890–3894, 2021. doi: 10.1109/ICASSP39728.2021.9413662.
  43. Arrangements. In Handbook of discrete and computational geometry, pages 723–762. Chapman and Hall/CRC, 2017.
  44. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  45. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  46. From tempered to benign overfitting in relu neural networks, 2023.
  47. The CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html, 2014.
  48. All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape, 2020.
  49. Implicit bias in deep linear classification: Initialization scale vs training accuracy, 2020.
  50. In search of the real inductive bias: On the role of implicit regularization in deep learning, 2014.
  51. Piyush C Ojha. Enumeration of linear threshold functions from the lattice of hyperplane intersections. IEEE Transactions on Neural Networks, 11(4):839–850, 2000.
  52. Minimum ”norm” neural networks are splines, 2019.
  53. Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7695–7705, Virtual, 13–18 Jul 2020. PMLR. URL http://proceedings.mlr.press/v119/pilanci20a.html.
  54. A new algorithm for enumeration of cells of hyperplane arrangements and a comparison with avis and fukuda’s reverse search. SIAM Journal on Discrete Mathematics, 32(1):455–473, 2018.
  55. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
  56. R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.
  57. L1 regularization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pages 544–558. Springer, 2007.
  58. Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, New York, 1964.
  59. Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=fGF8qAqpXXG.
  60. Hidden convexity of wasserstein GANs: Interpretable generative models with closed-form solutions. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=e2Lle5cij9D.
  61. Unraveling attention via convex duality: Analysis and interpretations of vision transformers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 19050–19088. PMLR, 17–23 Jul 2022b. URL https://proceedings.mlr.press/v162/sahiner22a.html.
  62. How do infinite width bounded norm networks look in function space? CoRR, abs/1902.05040, 2019. URL http://arxiv.org/abs/1902.05040.
  63. Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161, 2009.
  64. Maurice Sion. On general minimax theorems. Pacific J. Math., 8(1):171–176, 1958. URL https://projecteuclid.org:443/euclid.pjm/1103040253.
  65. Richard P Stanley et al. An introduction to hyperplane arrangements. Geometric combinatorics, 13:389–496, 2004.
  66. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
  67. Ryan J Tibshirani. Equivalences between sparse models and neural networks, 2021.
  68. J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Info Theory, 50(10):2231–2242, 2004.
  69. Sdpt3—a matlab software package for semidefinite-quadratic-linear programming, version 3.0, 2001.
  70. The hidden convex optimization landscape of regularized two-layer reLU networks: an exact characterization of optimal solutions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Z7Lk2cQEG8a.
  71. Kernel and rich regimes in overparametrized models. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3635–3673. PMLR, 09–12 Jul 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html.
  72. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  73. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tolga Ergen (23 papers)
  2. Mert Pilanci (102 papers)
Citations (2)

Summary

The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models

The paper "The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models" by Tolga Ergen and Mert Pilanci addresses a prominent challenge in machine learning: the non-convex nature of training deep neural networks (DNNs). Traditional methods for training DNNs rely on non-convex optimization heuristics, which often require empirical methods and lack a solid theoretical foundation. The authors present a novel approach utilizing convex optimization theory and sparse recovery models to refine neural network training, particularly focusing on two-layer neural networks with piecewise linear activations.

Major Contributions

  1. Convex Formulations for Neural Networks: The paper introduces a convex analytical framework that transforms the training of two-layer neural networks with piecewise linear activations such as ReLU, leaky ReLU, and absolute value activation into finite-dimensional convex programs. These convex programs incorporate a regularization term promoting sparsity through a variant of group Lasso. The approach leverages semi-infinite programming to establish strong duality for finite-width neural networks, equating them to high-dimensional convex sparse recovery models.
  2. Computational Complexity: Notably, the worst-case complexity of solving this convex program is polynomial in the number of samples and neurons, given that the data matrix rank is bounded. For datasets with arbitrary ranks, the authors propose a polynomial-time approximation scheme based on zonotope subsampling, ensuring an approximation ratio. This scheme allows the characterization of all stationary points of the non-convex objective as the global optimum of a subsampled convex program.
  3. Convex Models and Hyperparameters: The convex models train using standard convex solvers without requiring non-convex heuristics or extensive hyperparameter tuning. This robustness results from the convex nature of the problem, making the solutions insensitive to hyperparameters such as initialization, batch sizes, and step size schedules.
  4. Empirical Validation: Through extensive numerical experiments, the authors demonstrate that convex models can outperform traditional non-convex methods, showing superior performance while being less sensitive to hyperparameter choices.

Theoretical Insights and Extensions

  • Extension to CNNs:

The paper also extends the proposed convex approach to convolutional neural networks (CNNs). For CNNs with global average pooling, the authors show that the training problem can be formulated as a standard fully connected network training problem. They also derive a semi-definite program (SDP) formulation for linear CNNs, optimizing them via nuclear norm regularization.

  • Handling Bias Terms and Different Regularizations:

The theoretical framework is extended to accommodate bias terms in neurons and diverse regularization mechanisms, including p\ell_p-norm regularization. This adaptability illustrates the framework's broad applicability across various neural network architectures.

Practical Implications and Future Directions

The ability to train neural networks using convex optimization opens new avenues for ensuring global optimality and robustness in model training. The paper proposes that convex models inherently possess a bias towards simpler solutions, promoting both sparsity and interpretability through the equivalent Lasso formulations. Future research might explore deeper neural networks, recurrent architectures, and transformer models, leveraging the strong duality and convex representations highlighted by this paper.

Conclusion

The insights from "The Convex Landscape of Neural Networks" challenge prevailing methods for neural network training. By framing the problem within a convex optimization context, the paper not only guarantees global optimality, provided that the network adheres to the derived constraints, but it also provides valuable interpretative mechanisms. The implications for both theoretical advancements and practical applications in AI signal a critical step forward in making deep learning frameworks more robust, efficient, and explainable.

X Twitter Logo Streamline Icon: https://streamlinehq.com