Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Central Limit Theorem for Two-Timescale Stochastic Approximation with Markovian Noise: Theory and Applications (2401.09339v2)

Published 17 Jan 2024 in stat.ML, cs.LG, and math.OC

Abstract: Two-timescale stochastic approximation (TTSA) is among the most general frameworks for iterative stochastic algorithms. This includes well-known stochastic optimization methods such as SGD variants and those designed for bilevel or minimax problems, as well as reinforcement learning like the family of gradient-based temporal difference (GTD) algorithms. In this paper, we conduct an in-depth asymptotic analysis of TTSA under controlled Markovian noise via central limit theorem (CLT), uncovering the coupled dynamics of TTSA influenced by the underlying Markov chain, which has not been addressed by previous CLT results of TTSA only with Martingale difference noise. Building upon our CLT, we expand its application horizon of efficient sampling strategies from vanilla SGD to a wider TTSA context in distributed learning, thus broadening the scope of Hu et al. (2022). In addition, we leverage our CLT result to deduce the statistical properties of GTD algorithms with nonlinear function approximation using Markovian samples and show their identical asymptotic performance, a perspective not evident from current finite-time bounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In Advances in Neural Information Processing Systems, volume 33, pages 17526–17535.
  2. Non-backtracking random walks mix faster. Communications in Contemporary Mathematics, 9(04):585–603.
  3. Markovian stochastic approximation with expanding projections. Bernoulli, pages 545–585.
  4. Dynamic social learning under graph constraints. IEEE Transactions on Control of Network Systems, 9(3):1435–1446.
  5. Analysis of a target-based actor-critic algorithm with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 991–1040. PMLR.
  6. Comparing mixing times on sparse random graphs. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1734–1740. SIAM.
  7. Strongly vertex-reinforced-random-walk on the complete graph. arXiv preprint arXiv:1208.6375.
  8. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
  9. Borkar, V. (2022). Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency.
  10. Concentration bounds for two time scale stochastic approximation. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 504–511. IEEE.
  11. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311.
  12. Brémaud, P. (2013). Markov chains: Gibbs fields, Monte Carlo simulation, and queues, volume 31. Springer Science & Business Media.
  13. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27.
  14. Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press.
  15. Chen, H.-F. (2006). Stochastic approximation and its applications, volume 64. Springer Science & Business Media.
  16. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 4173–4183. PMLR.
  17. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623.
  18. Sgda with shuffling: faster convergence for nonconvex-pł minimax optimization. In The Eleventh International Conference on Learning Representations.
  19. A tale of two-timescale reinforcement learning with the tightest finite-time bound. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3701–3708.
  20. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Conference On Learning Theory, pages 1199–1233. PMLR.
  21. Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15:809–883.
  22. Sampling without replacement leads to faster rates in finite-sum minimax optimization. In Advances in Neural Information Processing Systems.
  23. Davis, B. (1970). On the intergrability of the martingale square function. Israel Journal of Mathematics, 8:187–190.
  24. Delyon, B. (2000). Stochastic approximation with decreasing gain: Convergence and asymptotic theory. Technical report, Université de Rennes.
  25. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pages 94–128.
  26. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626–1635. PMLR.
  27. Doan, T. T. (2021a). Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation. SIAM Journal on Control and Optimization, 59(4):2798–2819.
  28. Doan, T. T. (2021b). Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under markovian noise. arXiv preprint arXiv:2104.01627.
  29. Doan, T. T. (2022). Nonlinear two-time-scale stochastic approximation convergence and finite-time performance. IEEE Transactions on Automatic Control.
  30. Self-repellent random walks on general graphs–achieving minimal sampling variance via nonlinear markov chains. In International Conference on Machine Learning. PMLR.
  31. Duflo, M. (1996). Algorithmes stochastiques, volume 23. Springer.
  32. Even, M. (2023). Stochastic gradient descent under markovian sampling schemes. In International Conference on Machine Learning.
  33. Fort, G. (2015). Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80.
  34. Convergence of markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866–893.
  35. Stochastic heavy ball. Electronic Journal of Statistics, 12:461–529.
  36. Gao, H. (2022). Decentralized stochastic gradient descent ascent for finite-sum minimax problems. arXiv preprint arXiv:2212.02724.
  37. On the convergence of distributed stochastic bilevel optimization algorithms over a network. In International Conference on Artificial Intelligence and Statistics, pages 9238–9281. PMLR.
  38. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524–3534. PMLR.
  39. Sgd: General analysis and improved rates. In International Conference on Machine Learning, pages 5200–5209. PMLR.
  40. Martingale Limit Theory and Its Application. Communication and Behavior. Elsevier Science.
  41. Hendrikx, H. (2023). A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, pages 470–489. PMLR.
  42. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  43. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180.
  44. Efficiency ordering of stochastic gradient descent. In Advances in Neural Information Processing Systems.
  45. Kinetic langevin mcmc sampling without gradient lipschitz continuity–the strongly convex case. arXiv preprint arXiv:2301.08039.
  46. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. In Conference on Learning Theory, pages 2144–2203. PMLR.
  47. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944–1974. PMLR.
  48. Two time-scale stochastic approximation with controlled markov noise and off-policy temporal-difference learning. Mathematics of Operations Research, 43(1):130–151.
  49. Finite sample analysis of two-time-scale natural actor-critic algorithm. IEEE Transactions on Automatic Control.
  50. Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819.
  51. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
  52. Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Performance evaluation review, 40(1):319–330.
  53. Snap datasets: Stanford large network dataset collection.
  54. Sharp high-probability sample complexities for policy evaluation with linear function approximation. arXiv preprint arXiv:2305.19001.
  55. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems. arXiv preprint arXiv:1904.09712.
  56. Revisiting the central limit theorems for the sgd-type methods. arXiv preprint arXiv:2207.11755.
  57. Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690.
  58. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR.
  59. Two-timescale stochastic dispatch of smart distribution grids. IEEE Transactions on Smart Grid, 9(5):4282–4292.
  60. Variance-reduced off-policy tdc learning: non-asymptotic convergence analysis. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 14796–14806.
  61. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, pages 1204–1212.
  62. Meyn, S. (2022). Control systems and reinforcement learning. Cambridge University Press.
  63. Mira, A. (2001). Ordering and improving the performance of monte carlo markov chains. Statistical Science, pages 340–350.
  64. The compact law of the iterated logarithm for multivariate stochastic approximation algorithms. Stochastic analysis and applications, 23(1):181–203.
  65. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702.
  66. On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration. In Conference on Learning Theory, pages 2947–2997. PMLR.
  67. Neal, R. M. (2004). Improving asymptotic variance of mcmc estimators: Non-reversible chains are better. Technical report.
  68. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. In International Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR.
  69. Pelletier, M. (1998). On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244.
  70. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855.
  71. Two-timescale algorithms for learning nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379.
  72. Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering.
  73. How good is sgd with random shuffling? In Conference on Learning Theory, pages 3250–3284. PMLR.
  74. Finite-time error bounds for linear stochastic approximation and td learning. In Conference on Learning Theory, pages 2803–2830. PMLR.
  75. Reinforcement learning: An introduction. MIT press.
  76. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pages 993–1000.
  77. Fednest: Federated bilevel, minimax, and compositional optimization. In International Conference on Machine Learning, pages 21146–21179. PMLR.
  78. Smg: A shuffling gradient-based method with momentum. In International Conference on Machine Learning, pages 10379–10389. PMLR.
  79. Decentralized learning with random walks and communication-efficient adaptive optimization. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022).
  80. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690.
  81. Provably efficient neural gtd algorithm for off-policy learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 10431–10442.
  82. Decentralized td tracking with linear function approximation and its finite-time analysis. In Advances in Neural Information Processing Systems, volume 33, pages 13762–13772.
  83. Non-asymptotic analysis for two time-scale tdc with general smooth function approximation. In Advances in Neural Information Processing Systems.
  84. Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization. arXiv preprint arXiv:2307.10053.
  85. Sample complexity bounds for two timescale value-based reinforcement learning algorithms. In International Conference on Artificial Intelligence and Statistics, pages 811–819. PMLR.
  86. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent markov noise. Mathematics of Operations Research, 45(4):1405–1444.
  87. Two-timescale voltage control in distribution grids using deep reinforcement learning. IEEE Transactions on Smart Grid, 11(3):2313–2323.
  88. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. arXiv preprint arXiv:2109.14756.
  89. First-order algorithms without lipschitz gradient: A sequential local optimization approach. arXiv preprint arXiv:2010.03194.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com