Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Gradient Succeeds for Bandits (2402.17235v1)

Published 27 Feb 2024 in cs.LG

Abstract: We show that the \emph{stochastic gradient} bandit algorithm converges to a \emph{globally optimal} policy at an $O(1/t)$ rate, even with a \emph{constant} step size. Remarkably, global convergence of the stochastic gradient bandit algorithm has not been previously established, even though it is an old algorithm known to be applicable to bandits. The new result is achieved by establishing two novel technical findings: first, the noise of the stochastic updates in the gradient bandit algorithm satisfies a strong growth condition'' property, where the variance diminishes whenever progress becomes small, implying that additional noise control via diminishing step sizes is unnecessary; second, a form ofweak exploration'' is automatically achieved through the stochastic gradient updates, since they prevent the action probabilities from decaying faster than $O(1/t)$, thus ensuring that every action is sampled infinitely often with probability $1$. These two findings can be used to show that the stochastic gradient update is already ``sufficient'' for bandits in the sense that exploration versus exploitation is automatically balanced in a manner that ensures almost sure convergence to a global optimum. These novel theoretical findings are further verified by experimental results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
  2. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp.  39–1. JMLR Workshop and Conference Proceedings, 2012.
  3. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
  4. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002a.
  5. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  6. Incremental natural actor-critic algorithms. Advances in neural information processing systems, 20, 2007.
  7. On concentration of self-bounding functions. Electronic Journal of Probability, 14:1884–1899, 2009.
  8. Breiman, L. Probability. SIAM, 1992.
  9. Improved risk tail bounds for on-line algorithms. In Weiss, Y., Schölkopf, B., and Platt, J. (eds.), Advances in Neural Information Processing Systems, volume 18. MIT Press, 2005. URL https://proceedings.neurips.cc/paper_files/paper/2005/file/5d75b942ab4bd730bc2e819df9c9a4b5-Paper.pdf.
  10. Boltzmann exploration done right. Advances in neural information processing systems, 30, 2017.
  11. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. arXiv preprint arXiv:2008.13773, 2020.
  12. Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization. arXiv preprint arXiv:2110.10117, 2021.
  13. Doob, J. L. Measure theory, volume 143. Springer Science & Business Media, 2012.
  14. Freedman, D. A. On Tail Probabilities for Martingales. The Annals of Probability, 3(1):100 – 118, 1975. doi: 10.1214/aop/1176996452. URL https://doi.org/10.1214/aop/1176996452.
  15. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342, 2021.
  16. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  17. Glynn, P. W. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  18. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
  19. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  20. Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
  21. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  22. Block policy mirror descent. arXiv preprint arXiv:2201.05756, 2022.
  23. Bandit algorithms. Cambridge University Press, 2020.
  24. Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pp.  3107–3110. PMLR, 2021.
  25. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264, 2018.
  26. Concentration for self-bounding functions and an inequality of talagrand. Random Structures & Algorithms, 29(4):549–557, 2006.
  27. Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33:21130–21140, 2020a.
  28. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820–6829. PMLR, 2020b.
  29. Understanding the effect of stochasticity in policy optimization. Advances in Neural Information Processing Systems, 34:19339–19351, 2021a.
  30. Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pp. 7555–7564. PMLR, 2021b.
  31. The role of baselines in policy gradient optimization. Advances in Neural Information Processing Systems, 2022.
  32. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  33. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  34. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  35. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
  36. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
  37. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  38. Reinforcement Learning: An Introduction. MIT Press, 2018.
  39. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  40. The mirage of action-dependent baselines in reinforcement learning. In International conference on machine learning, pp. 5015–5024. PMLR, 2018.
  41. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
  42. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
  43. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pp.  3332–3380. PMLR, 2022.
  44. Sample efficient reinforcement learning with reinforce. arXiv preprint arXiv:2010.11364, 2020a.
  45. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021.
  46. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020b.

Summary

We haven't generated a summary for this paper yet.