Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Step-Size Assumptions in Stochastic Approximation (2405.17834v2)

Published 28 May 2024 in math.ST, stat.ML, and stat.TH

Abstract: Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $\alpha_n = \alpha_0 n{-\rho}$ at iteration $n$, with $\rho \in [0,1]$ and $\alpha_0>0$ design parameters. It is most common in practice to take $\rho=0$ (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with $\rho \in (1/2, 1)$ it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of $O(1/n)$ and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided $0<\rho<1$: $\bullet$ Parameter estimates converge with probability one, and also in $L_p$ for any $p\ge 1$. $\bullet$ The MSE may converge very slowly for small $\rho$, of order $O(\alpha_n2)$ even with averaging. $\bullet$ For linear stochastic approximation the source of slow convergence is identified: for any $\rho\in (0,1)$, averaging results in estimates for which the error $\textit{covariance}$ vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the $\textit{bias}$ converges to zero at rate $O(\alpha_n)$. This is the first paper to obtain such strong conclusions while allowing for $\rho \le 1/2$. A major conclusion is that the choice of $\rho =0$ or even $\rho<1/2$ is justified only in select settings -- In general, bias may preclude fast convergence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. S. Allmeier and N. Gast. Computing the bias of constant-step stochastic approximation with Markovian noise. arXiv preprint arXiv:2405.14285, 2024.
  2. F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o⁢(1/n)𝑜1𝑛o(1/n)italic_o ( 1 / italic_n ). In Proc. Advances in Neural Information Processing Systems, volume 26, pages 773–781, 2013.
  3. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, Berlin Heidelberg, 2012.
  4. A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691–1692, 2018.
  5. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv e-prints:2110.14427, pages 1–50, 2021.
  6. V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, Delhi, India, 2nd edition, 2021.
  7. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
  8. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
  9. K. L. Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
  10. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017.
  11. T. T. Doan. Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation. SIAM Journal on Control and Optimization, 59(4):2798–2819, 2021.
  12. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation. Mathematics of Operations Research, 2024.
  13. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996.
  14. Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. arXiv.2210.00953 (Abstract published in Proceedings of ACM SIGMETRICS , pages 81–82, 2023), 2022.
  15. What is local optimality in nonconvex-nonconcave minimax optimization? In International conference on machine learning, pages 4880–4889. PMLR, 2020.
  16. Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. arXiv e-prints, page arXiv:2002.01268, Feb. 2020.
  17. P. Karmakar and S. Bhatnagar. Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res., 43(1):130–151, 2018.
  18. D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, and preprint arXiv:1412.6980, 2015.
  19. V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.
  20. Actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
  21. I. Kontoyiannis and S. P. Meyn. Large deviations asymptotics and the spectral theory of multiplicatively regular Markov processes. Electron. J. Probab., 10(3):61–123 (electronic), 2005.
  22. C. K. Lauand and S. Meyn. Approaching quartic convergence rates for quasi-stochastic approximation with application to gradient-free optimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15743–15756. Curran Associates, Inc., 2022.
  23. C. K. Lauand and S. Meyn. Markovian foundations for quasi stochastic approximation with applications to extremum seeking control. arXiv 2207.06371, 2022.
  24. C. K. Lauand and S. Meyn. The curse of memory in stochastic approximation. In Proc. IEEE Conference on Decision and Control, pages 7803–7809, 2023.
  25. C. K. Lauand and S. Meyn. Quasi-stochastic approximation: Design principles with applications to extremum seeking control. IEEE Control Systems Magazine, 43(5):111–136, Oct 2023.
  26. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Conference on Learning Theory, pages 909–981. PMLR, 2022.
  27. M. Metivier and P. Priouret. Theoremes de convergence presque sure pour une classe d’algorithmes stochastiques a pas decroissants. Prob. Theory Related Fields, 74:403–428, 1987.
  28. S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, Cambridge, 2022.
  29. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
  30. On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Conference on Learning Theory and arXiv:2004.04719, pages 2947–2997, 2020.
  31. B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
  32. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
  33. D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.
  34. R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830, 2019.
  35. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.
  36. C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
  37. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019.
  38. M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-bounds for Q𝑄Qitalic_Q-learning. CoRR, abs/1905.06265, 2019.
  39. F. Zarin Faizal and V. Borkar. Functional Central Limit Theorem for Two Timescale Stochastic Approximation. arXiv e-prints, page arXiv:2306.05723, June 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com