Robust Implicit Regularization via Weight Normalization (2305.05448v4)
Abstract: Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice. However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analyzing gradient flow (continuous-time version of gradient descent) with weight normalization, where the weight vector is reparameterized in terms of polar coordinates, and gradient flow is applied to the polar coordinates. By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization in overparameterized diagonal linear network models.
- Convergence of the iterates of descent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531–547, 2005.
- On the optimization of deep networks: Implicit acceleration by overparameterization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pages 244–253, 2018.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7413–7424, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Information and Inference: A Journal of the IMA, 11(1):307–353, 02 2021.
- L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.
- Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank. arXiv preprint: 2011.13772, 2020.
- More is less: inducing sparsity via overparameterization. Information and Inference: A Journal of the IMA, 12(3), 04 2023. iaad012.
- Deep linear networks for matrix completion – an infinite depth limit. arXiv preprint arXiv: 2210.12497, 2023.
- T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput., 14(3):326–334, 1965.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
- Optimization theory for relu neural networks trained with normalization layers. In International conference on machine learning, pages 2751–2760. PMLR, 2020.
- J. Flum and M. Grohe. Parameterized Complexity Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006.
- Low-rank regularization and solution uniqueness in over-parameterized matrix sensing. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 930–940, 2020.
- Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems, pages 3202–3211, 2019.
- The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2020.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
- Characterizing implicit bias in terms of optimization geometry. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1832–1841. PMLR, 10–15 Jul 2018.
- Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
- P. D. Hoff. Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization. Computational Statistics & Data Analysis, 115:186–198, 2017.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pages 8580–8589, 2018.
- Implicit sparse regularization: The impact of depth and early stopping. In Advances in Neural Information Processing Systems, pages 28298–28309, 2021.
- D. Morwani and H. G. Ramaswamy. Inductive bias of gradient descent for weight normalized smooth homogeneous neural nets. In International Conference on Algorithmic Learning Theory, pages 827–880. PMLR, 2022.
- Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv: 1705.03071, 2017.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015.
- Complexity control by gradient descent in deep networks. Nature communications, 11(1):1–5, 2020.
- N. Razin and N. Cohen. Implicit regularization in deep learning may not be explainable by norms. In Advances in Neural Information Processing Systems, pages 21174–21187, 2020.
- Implicit regularization in tensor factorization. arXiv preprint: 2102.09972, 2021.
- Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks. arXiv preprint arXiv: 2201.11729, 2022.
- T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, volume 29, page 901. Curran Associates, Inc., 2016.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- D. Stöger and M. Soltanolkotabi. Small random initialization is akin to spectral learning : Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. In Advances in Neural Information Processing Systems, pages 23831–23843, 2021.
- Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pages 2972–2983, 2019.
- Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory, pages 3635–3673, 2020.
- Implicit regularization and convergence for weight normalization. In Advances in Neural Information Processing Systems, volume 33, pages 2835–2847. Curran Associates, Inc., 2020.
- Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
- Adaloss: A computationally-efficient and provably convergent adaptive gradient method. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36(8), pages 8691–8699, 2022.
- Robust recovery via implicit bias of discrepant learning rates for double over-parameterization. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17733–17744. Curran Associates, Inc., 2020.
- Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression. arXiv preprint: 1903.09367, 2019.