Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks (2110.09548v4)

Published 18 Oct 2021 in cs.LG, cs.AI, and stat.ML

Abstract: Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, since the original training problem may not be trainable in polynomial-time, we propose an approximate algorithm with a fully polynomial-time complexity in all data dimensions. Then, we prove strong global optimality guarantees for this algorithm. We also provide experiments corroborating our theory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Norm-based capacity control in neural networks. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1376–1401, Paris, France, 03–06 Jul 2015. PMLR.
  2. Path-sgd: Path-normalized optimization in deep neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  3. Convex geometry of two-layer relu networks: Implicit autoencoding and interpretable models. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4024–4033, Online, 26–28 Aug 2020. PMLR.
  4. Convex geometry and duality of over-parameterized neural networks. Journal of Machine Learning Research, 22(212):1–63, 2021.
  5. T. Ergen and M. Pilanci. Convex optimization for shallow neural networks. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 79–83, 2019.
  6. Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7695–7705. PMLR, 13–18 Jul 2020.
  7. Failures of gradient-based deep learning. arXiv preprint arXiv:1703.07950, 2017.
  8. Deep learning, volume 1. MIT press Cambridge, 2016.
  9. Learning one-hidden-layer neural networks with landscape design, 2017.
  10. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018.
  11. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Conference on learning theory, pages 81–102, 2016.
  12. On the complexity of training neural networks with continuous activation functions. IEEE Transactions on Neural Networks, 6(6):1490–1504, 1995.
  13. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pages 494–501, 1989.
  14. Hardness results for neural network approximation problems. In European Conference on Computational Learning Theory, pages 50–62. Springer, 1999.
  15. SGD learns over-parameterized networks that provably generalize on linearly separable data. CoRR, abs/1710.10174, 2017.
  16. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
  17. On the optimization of deep networks: Implicit acceleration by overparameterization. In 35th International Conference on Machine Learning, ICML 2018, pages 372–389. International Machine Learning Society (IMLS), 2018.
  18. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
  19. Parallel deep neural networks have zero duality gap. In The Eleventh International Conference on Learning Representations, 2023.
  20. Revealing the structure of deep neural networks via convex duality. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3004–3014. PMLR, 18–24 Jul 2021.
  21. Global optimality beyond two layers: Training deep relu networks via convex programs. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2993–3003. PMLR, 18–24 Jul 2021.
  22. Deep neural networks with multi-branch architectures are intrinsically less non-convex. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1099–1109, 2019.
  23. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017.
  24. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  25. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016.
  26. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  27. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
  28. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  29. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  30. Understanding deep neural networks with rectified linear units. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
  31. Tight Hardness Results for Training Depth-2 ReLU Networks. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), volume 185 of Leibniz International Proceedings in Informatics (LIPIcs), pages 22:1–22:14, Dagstuhl, Germany, 2021. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
  32. The computational complexity of relu network training parameterized by data dimensionality, 2021.
  33. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  34. Representation costs of linear neural networks: Analysis and design. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 26884–26896. Curran Associates, Inc., 2021.
  35. L1 regularization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pages 544–558. Springer, 2007.
  36. Convex optimization. Cambridge university press, 2004.
  37. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  38. Learning and generalization in overparameterized neural networks, going beyond two layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  39. Global convergence of three-layer neural networks in the mean field regime. In International Conference on Learning Representations, 2021.
  40. Greedy layerwise learning can scale to imagenet. In International conference on machine learning, pages 583–593. PMLR, 2019.
  41. The CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html, 2014.
  42. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  43. Globally optimal training of neural networks with threshold activation functions. In The Eleventh International Conference on Learning Representations, 2023.
  44. Implicit convex regularizers of cnn architectures: Convex optimization of two- and three-layer networks in polynomial time. In International Conference on Learning Representations, 2021.
  45. Hidden convexity of wasserstein GANs: Interpretable generative models with closed-form solutions. In International Conference on Learning Representations, 2022.
  46. Demystifying batch normalization in reLU networks: Equivalent convex optimization models and implicit regularization. In International Conference on Learning Representations, 2022.
  47. Convex neural autoregressive models: Towards tractable, expressive, and theoretically-backed models for sequential forecasting and generation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3890–3894, 2021.
  48. Convexifying transformers: Improving optimization and understanding of transformer networks. arXiv, 2022.
  49. Unraveling attention via convex duality: Analysis and interpretations of vision transformers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 19050–19088. PMLR, 17–23 Jul 2022.
  50. Daniel Smilkov Carter and Shan. A neural network playground. https://playground.tensorflow.org/.
  51. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  52. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  53. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
  54. UCI machine learning repository, 2017.
  55. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(90):3133–3181, 2014.
  56. Maurice Sion. On general minimax theorems. Pacific J. Math., 8(1):171–176, 1958.
  57. Piyush C Ojha. Enumeration of linear threshold functions from the lattice of hyperplane intersections. IEEE Transactions on Neural Networks, 11(4):839–850, 2000.
  58. Richard P Stanley et al. An introduction to hyperplane arrangements. Geometric combinatorics, 13:389–496, 2004.
  59. RO Winder. Partitions of n-space by hyperplanes. SIAM Journal on Applied Mathematics, 14(4):811–818, 1966.
  60. Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3):326–334, 1965.
  61. Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms. In International Conference on Learning Representations, 2021.
Citations (16)

Summary

We haven't generated a summary for this paper yet.