Making SGD Parameter-Free (2205.02160v3)
Abstract: We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.
- L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16(1):1–3, 1966.
- H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
- A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- Training neural networks for and by interpolation. In International Conference on Machine Learning, pages 799–809, 2020.
- D. Blackwell. Large deviations for martingales. In Festschrift for Lucien Le Cam, pages 89–91. Springer, 1997.
- S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and trends in machine learning, 2015.
- Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
- A parameter-free hedging algorithm. Advances in neural information processing systems, 2009.
- Better parameter-free stochastic optimization with ODE updates for coin-betting. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- A. Cutkosky. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894, 2019.
- A. Cutkosky and K. Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643–677, 2017.
- A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in Banach spaces. In Conference On Learning Theory, pages 1493–1529, 2018.
- M. Feurer and F. Hutter. Hyperparameter optimization. In Automated machine learning, pages 3–33. Springer, 2019.
- S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
- E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016.
- E. Hazan and S. Kakade. Revisiting the Polyak step size. arXiv:1905.00313, 2019.
- E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
- A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv:1401.1792, 2014.
- UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in Neural Information Processing Systems, 2019.
- Adaptive scale-invariant online algorithms for learning linear models. In International Conference on Machine Learning, pages 3321–3330, 2019.
- Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314, 2021.
- B. McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
- H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.
- Z. Mhammedi and W. M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887, 2020.
- F. Orabona. Dimension-free exponentiated gradient. Advances in Neural Information Processing Systems, 26, 2013.
- F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. Advances in Neural Information Processing Systems, 2014.
- F. Orabona. A modern introduction to online learning. arXiv:1912.13213, 2021.
- F. Orabona and A. Cutkosky. ICML tutorial on parameter-free stochastic optimization, 2020.
- F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29:577–585, 2016.
- F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv:2102.00236, 2021.
- F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30:2160–2170, 2017.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- B. T. Polyak. Introduction to Optimization. 1987.
- M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning. Advances in neural information processing systems, 2018.
- O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
- M. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. Advances in Neural Information Processing Systems, pages 2402–2410, 2012.
- E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Conference on Learning Theory, 2000.
- Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 2019.
- Statistical adaptive stochastic gradient methods. arXiv:2002.10597, 2020.