Papers
Topics
Authors
Recent
Search
2000 character limit reached

Making SGD Parameter-Free

Published 4 May 2022 in math.OC, cs.LG, and stat.ML | (2205.02160v3)

Abstract: We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16(1):1–3, 1966.
  2. H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
  3. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  4. Training neural networks for and by interpolation. In International Conference on Machine Learning, pages 799–809, 2020.
  5. D. Blackwell. Large deviations for martingales. In Festschrift for Lucien Le Cam, pages 89–91. Springer, 1997.
  6. S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and trends in machine learning, 2015.
  7. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
  8. A parameter-free hedging algorithm. Advances in neural information processing systems, 2009.
  9. Better parameter-free stochastic optimization with ODE updates for coin-betting. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  10. A. Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.
  11. A. Cutkosky and K. Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.
  12. A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, pages 1493–1529, 2018.
  13. M. Feurer and F. Hutter. Hyperparameter optimization. In Automated machine learning, pages 3–33. Springer, 2019.
  14. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  15. E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016.
  16. E. Hazan and S. Kakade. Revisiting the Polyak step size. arXiv:1905.00313, 2019.
  17. E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
  18. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
  19. A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv:1401.1792, 2014.
  20. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 2019.
  21. Adaptive scale-invariant online algorithms for learning linear models. In International Conference on Machine Learning, pages 3321–3330, 2019.
  22. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314, 2021.
  23. B. McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
  24. H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.
  25. Z. Mhammedi and W. M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887, 2020.
  26. F. Orabona. Dimension-free exponentiated gradient. Advances in Neural Information Processing Systems, 26, 2013.
  27. F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. Advances in Neural Information Processing Systems, 2014.
  28. F. Orabona. A modern introduction to online learning. arXiv:1912.13213, 2021.
  29. F. Orabona and A. Cutkosky. ICML tutorial on parameter-free stochastic optimization, 2020.
  30. F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29:577–585, 2016.
  31. F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv:2102.00236, 2021.
  32. F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30:2160–2170, 2017.
  33. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  34. B. T. Polyak. Introduction to Optimization. 1987.
  35. M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning. Advances in neural information processing systems, 2018.
  36. O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
  37. M. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. Advances in Neural Information Processing Systems, pages 2402–2410, 2012.
  38. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Conference on Learning Theory, 2000.
  39. Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 2019.
  40. Statistical adaptive stochastic gradient methods. arXiv:2002.10597, 2020.
Citations (36)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 13 likes about this paper.