A Unified Approach to Controlling Implicit Regularization via Mirror Descent (2306.13853v2)
Abstract: Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their "preferred" solutions, a phenomenon commonly referred to as implicit regularization. In particular, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization. To address this, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Further, we show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions. Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations, 2019a.
- A stochastic interpretation of stochastic mirror descent: Risk-sensitive optimality. In 2019 IEEE 58th Conference on Decision and Control, pages 3960–3965. IEEE, 2019b.
- Explicit regularization via regularizer mirror descent. In Over-parameterization Workshop at the International Conference on Machine Learning, 2021a.
- Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 2021b.
- Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
- A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2017.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sébastien Bubeck. Convex optimization: algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Fast rates for noisy interpolation require rethinking the effects of inductive bias. arXiv preprint arXiv:2203.03597, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.
- Claudio Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265–299, 2003.
- General convergence results for linear discriminant updates. Machine Learning, 43(3):173–210, 2001.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv preprint arXiv:2006.07322, 2020.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019a.
- Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019b.
- Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory, pages 772–804. PMLR, 2021.
- Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pages 2109–2136. PMLR, 2020.
- Fast margin maximization via dual acceleration. In International Conference on Machine Learning, pages 4860–4869. PMLR, 2021.
- First order methods for nonsmooth convex large-scale optimization, i: general purpose methods. Optimization for Machine Learning, 30(9):121–148, 2011.
- Learning multiple layers of features from tiny images. 2009.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Implicit regularization of bregman proximal point algorithm and mirror descent on separable data. arXiv preprint arXiv:2108.06808, 2021.
- Just interpolate: Kernel “ridgeless” regression can generalize. 2020.
- Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
- Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
- On the implicit bias of dropout. In International conference on machine learning, pages 3540–3548. PMLR, 2018.
- Classification vs regression in overparameterized regimes: Does the loss function matter? The Journal of Machine Learning Research, 22(1):10104–10172, 2021.
- Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- Problem complexity and method efficiency in optimization. 1983.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- R Tyrrell Rockafellar. Convex analysis. Number 28. Princeton University Press, 1970.
- Boosting as a regularized path to a maximum margin classifier. The Journal of Machine Learning Research, 5:941–973, 2004.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Mirror descent maximizes generalized margin and can be implemented efficiently. In Advances in Neural Information Processing Systems, volume 35, pages 31089–31101, 2022.
- Matus Telgarsky. Margins, shrinkage, and boosting. In International Conference on Machine Learning, pages 307–315. PMLR, 2013.
- On margin maximization in linear and relu networks. Advances in Neural Information Processing Systems, 35:37024–37036, 2022.
- The implicit bias for adaptive optimization algorithms on homogeneous neural networks. In International Conference on Machine Learning, pages 10849–10858. PMLR, 2021.
- The implicit and explicit regularization effects of dropout. In International conference on machine learning, pages 10181–10192. PMLR, 2020.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467–492, 2020.