Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently (2205.12808v2)

Published 25 May 2022 in cs.LG

Abstract: Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  2. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations, 2019a.
  3. A stochastic interpretation of stochastic mirror descent: Risk-sensitive optimality. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3960–3965. IEEE, 2019b.
  4. Beyond implicit regularization: Avoiding overfitting via regularizer mirror descent. In International Conference on Machine Learning, Workshop on Overparameterization: Pitfalls & Opportunities, 2021a.
  5. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 2021b.
  6. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
  7. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2017.
  8. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  9. Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Sébastien Bubeck. Convex optimization: algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  12. Fast rates for noisy interpolation require rethinking the effects of inductive bias. arXiv preprint arXiv:2203.03597, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.
  15. Claudio Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265–299, 2003.
  16. General convergence results for linear discriminant updates. Machine Learning, 43(3):173–210, 2001.
  17. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  20. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019.
  21. Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory, pages 772–804. PMLR, 2021.
  22. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pages 2109–2136. PMLR, 2020.
  23. Fast margin maximization via dual acceleration. In International Conference on Machine Learning, pages 4860–4869. PMLR, 2021.
  24. First order methods for nonsmooth convex large-scale optimization, i: general purpose methods. Optimization for Machine Learning, 30(9):121–148, 2011.
  25. Learning multiple layers of features from tiny images. 2009.
  26. Implicit regularization of bregman proximal point algorithm and mirror descent on separable data. arXiv preprint arXiv:2108.06808, 2021.
  27. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
  28. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
  29. One-pass learning via bridging orthogonal gradient descent and recursive least-squares. In 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022.
  30. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019.
  31. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  32. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  33. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  35. R Tyrrell Rockafellar. Convex analysis. Number 28. Princeton University Press, 1970.
  36. Boosting as a regularized path to a maximum margin classifier. The Journal of Machine Learning Research, 5:941–973, 2004.
  37. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  38. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  39. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  40. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  41. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  42. Matus Telgarsky. Margins, shrinkage, and boosting. In International Conference on Machine Learning, pages 307–315. PMLR, 2013.
  43. The implicit bias for adaptive optimization algorithms on homogeneous neural networks. In International Conference on Machine Learning, pages 10849–10858. PMLR, 2021.
  44. The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
  45. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Citations (19)

Summary

We haven't generated a summary for this paper yet.