Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces (2401.05233v1)

Published 10 Jan 2024 in cs.LG, cs.IT, cs.SY, eess.SY, math.IT, math.OC, and stat.ML

Abstract: We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the BeLLMan operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Large-scale wildfire mitigation through deep reinforcement learning. Frontiers in Forests and Global Change, 5:734330, 2022.
  2. Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
  3. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
  4. D. Bertsekas. Lessons from AlphaZero for optimal, model predictive, and adaptive control. Athena Scientific, 2022.
  5. D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
  6. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
  7. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602:414–419, 2022.
  8. Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
  9. Risk bounds and Rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning, pages 2892–2902. PMLR, 2021.
  10. Y. Duan and M. J. Wainwright. Covariate shift in linear quadratic control.
  11. Y. Duan and M. J. Wainwright. Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. arXiv preprint arXiv:2211.03899, 2022.
  12. Y. Duan and M. Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
  13. Optimal policy evaluation using kernel-based temporal difference methods. arXiv preprint arXiv:2109.12002, 2021.
  14. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  15. Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service Operations Management, 24(3):1349–1368, 2022.
  16. A. Gonen and S. Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle problems. In Conference on Learning Theory, pages 1043–1063. PMLR, 2017.
  17. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
  18. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021.
  19. Fast rates for the regret of offline reinforcement learning. Conference on Learning Theory, 134:2462–2462, 2021.
  20. Is Q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
  21. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  22. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  23. Is pessimism provably efficient for offline RL? International Conference on Machine Learning, pages 5084–5096, 2021.
  24. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
  25. T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. Advances in Neural Information Processing Systems, 28, 2015.
  26. Learning with square loss: Localization through offset Rademacher complexity. In Conference on Learning Theory, pages 1260–1285. PMLR, 2015.
  27. Certainty equivalence is efficient for linear quadratic control. Advances in Neural Information Processing Systems, 32, 2019.
  28. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  29. On instance-dependent bounds for offline reinforcement learning with linear function approximation. Association for the Advancement of Artificial Intelligence, 2023.
  30. On the convergence of policy iteration in stationary dynamic programming. Mathematics of Operations Research, 4(1):60–69, 1979.
  31. A. Rao and T. Jelvis. Foundations of Reinforcement Learning with Applications to Finance. CRC Press, Boca Raton, FL, 2022.
  32. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  33. B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253–279, 2019.
  34. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
  35. Toward self-driving processes: A deep reinforcement learning approach to control. Amer. Inst. Chem. Eng. Journal, 65:e16689, 2022.
  36. Reinforcement learning: An introduction. MIT press, 2018.
  37. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36. IEEE, 2017.
  38. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  39. On gap-dependent bounds for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:14865–14877, 2022.
  40. T. Xie and N. Jiang. Q*superscript𝑄{Q}^{*}italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR, 2020.
  41. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
  42. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
  43. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com