2000 character limit reached
Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces (2401.05233v1)
Published 10 Jan 2024 in cs.LG, cs.IT, cs.SY, eess.SY, math.IT, math.OC, and stat.ML
Abstract: We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the BeLLMan operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.
- Large-scale wildfire mitigation through deep reinforcement learning. Frontiers in Forests and Global Change, 5:734330, 2022.
- Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
- Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
- D. Bertsekas. Lessons from AlphaZero for optimal, model predictive, and adaptive control. Athena Scientific, 2022.
- D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
- On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
- Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602:414–419, 2022.
- Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
- Risk bounds and Rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning, pages 2892–2902. PMLR, 2021.
- Y. Duan and M. J. Wainwright. Covariate shift in linear quadratic control.
- Y. Duan and M. J. Wainwright. Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. arXiv preprint arXiv:2211.03899, 2022.
- Y. Duan and M. Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- Optimal policy evaluation using kernel-based temporal difference methods. arXiv preprint arXiv:2109.12002, 2021.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service Operations Management, 24(3):1349–1368, 2022.
- A. Gonen and S. Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle problems. In Conference on Learning Theory, pages 1043–1063. PMLR, 2017.
- Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
- Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021.
- Fast rates for the regret of offline reinforcement learning. Conference on Learning Theory, 134:2462–2462, 2021.
- Is Q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
- Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Is pessimism provably efficient for offline RL? International Conference on Machine Learning, pages 5084–5096, 2021.
- Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
- T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. Advances in Neural Information Processing Systems, 28, 2015.
- Learning with square loss: Localization through offset Rademacher complexity. In Conference on Learning Theory, pages 1260–1285. PMLR, 2015.
- Certainty equivalence is efficient for linear quadratic control. Advances in Neural Information Processing Systems, 32, 2019.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- On instance-dependent bounds for offline reinforcement learning with linear function approximation. Association for the Advancement of Artificial Intelligence, 2023.
- On the convergence of policy iteration in stationary dynamic programming. Mathematics of Operations Research, 4(1):60–69, 1979.
- A. Rao and T. Jelvis. Foundations of Reinforcement Learning with Applications to Finance. CRC Press, Boca Raton, FL, 2022.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253–279, 2019.
- Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
- Toward self-driving processes: A deep reinforcement learning approach to control. Amer. Inst. Chem. Eng. Journal, 65:e16689, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36. IEEE, 2017.
- M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- On gap-dependent bounds for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:14865–14877, 2022.
- T. Xie and N. Jiang. Q*superscript𝑄{Q}^{*}italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR, 2020.
- Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
- Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021.