Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making SGD Parameter-Free (2205.02160v3)

Published 4 May 2022 in math.OC, cs.LG, and stat.ML

Abstract: We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16(1):1–3, 1966.
  2. H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
  3. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  4. Training neural networks for and by interpolation. In International Conference on Machine Learning, pages 799–809, 2020.
  5. D. Blackwell. Large deviations for martingales. In Festschrift for Lucien Le Cam, pages 89–91. Springer, 1997.
  6. S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and trends in machine learning, 2015.
  7. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
  8. A parameter-free hedging algorithm. Advances in neural information processing systems, 2009.
  9. Better parameter-free stochastic optimization with ODE updates for coin-betting. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  10. A. Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.
  11. A. Cutkosky and K. Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.
  12. A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, pages 1493–1529, 2018.
  13. M. Feurer and F. Hutter. Hyperparameter optimization. In Automated machine learning, pages 3–33. Springer, 2019.
  14. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  15. E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016.
  16. E. Hazan and S. Kakade. Revisiting the Polyak step size. arXiv:1905.00313, 2019.
  17. E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
  18. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
  19. A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv:1401.1792, 2014.
  20. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 2019.
  21. Adaptive scale-invariant online algorithms for learning linear models. In International Conference on Machine Learning, pages 3321–3330, 2019.
  22. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314, 2021.
  23. B. McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
  24. H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.
  25. Z. Mhammedi and W. M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887, 2020.
  26. F. Orabona. Dimension-free exponentiated gradient. Advances in Neural Information Processing Systems, 26, 2013.
  27. F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. Advances in Neural Information Processing Systems, 2014.
  28. F. Orabona. A modern introduction to online learning. arXiv:1912.13213, 2021.
  29. F. Orabona and A. Cutkosky. ICML tutorial on parameter-free stochastic optimization, 2020.
  30. F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29:577–585, 2016.
  31. F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv:2102.00236, 2021.
  32. F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30:2160–2170, 2017.
  33. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  34. B. T. Polyak. Introduction to Optimization. 1987.
  35. M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning. Advances in neural information processing systems, 2018.
  36. O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
  37. M. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. Advances in Neural Information Processing Systems, pages 2402–2410, 2012.
  38. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Conference on Learning Theory, 2000.
  39. Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 2019.
  40. Statistical adaptive stochastic gradient methods. arXiv:2002.10597, 2020.
Citations (36)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com