A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs (2402.04493v2)
Abstract: We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an $\epsilon$-optimal policy with $O(\epsilon{-2})$ sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(\epsilon{-2})$ with partial data coverage assumption. Our work is an improvement upon a recent work that requires $O(\epsilon{-4})$ samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
- Yasin Abbasi-Yadkori, Dávid Pál and Csaba Szepesvári “Improved algorithms for linear stochastic bandits” In Advances in neural information processing systems, 2011
- Eitan Altman “Constrained Markov decision processes”, 2021
- András Antos, Csaba Szepesvári and Rémi Munos “Fitted Q-iteration in continuous action-space MDPs” In Advances in neural information processing systems, 2007
- “Achieving zero constraint violation for concave utility constrained reinforcement learning via primal-dual approach” In Journal of Artificial Intelligence Research, 2023
- “Safe learning in robotics: From learning-based control to safe reinforcement learning” In Annual Review of Control, Robotics, and Autonomous Systems, 2022
- “Information-theoretic considerations in batch reinforcement learning” In International Conference on Machine Learning, 2019 PMLR
- “Offline reinforcement learning under value and density-ratio realizability: the power of gaps” In Uncertainty in Artificial Intelligence, 2022 PMLR
- Yi Chen, Jing Dong and Zhaoran Wang “A primal-dual approach to constrained markov decision processes” In arXiv preprint arXiv:2101.10895, 2021
- “Natural policy gradient primal-dual method for constrained markov decision processes” In Advances in Neural Information Processing Systems, 2020
- “Offline Primal-Dual Reinforcement Learning for Linear MDPs” In arXiv preprint arXiv:2305.12944, 2023
- Elad Hazan “Introduction to online convex optimization” In Foundations and Trends® in Optimization, 2016
- Kihyuk Hong, Yuhang Li and Ambuj Tewari “A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning” In arXiv preprint arXiv:2306.07818, 2023
- “Provably efficient reinforcement learning with linear function approximation” In Conference on Learning Theory, 2020 PMLR
- Ying Jin, Zhuoran Yang and Zhaoran Wang “Is pessimism provably efficient for offline rl?” In International Conference on Machine Learning, 2021 PMLR
- “A workflow for offline model-free robotic reinforcement learning” In arXiv preprint arXiv:2109.10813, 2021
- “Bandit algorithms”, 2020
- Hoang Le, Cameron Voloshin and Yisong Yue “Batch policy learning under constraints” In International Conference on Machine Learning, 2019 PMLR
- “Offline reinforcement learning: Tutorial, review, and perspectives on open problems” In arXiv preprint arXiv:2005.01643, 2020
- “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection” In The International journal of robotics research, 2018
- Rémi Munos “Error bounds for approximate policy iteration” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 2003
- Rémi Munos “Error bounds for approximate value iteration” In Proceedings of the National Conference on Artificial Intelligence, 2005 Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999
- “Finite-Time Bounds for Fitted Value Iteration.” In Journal of Machine Learning Research, 2008
- Martin L Puterman “Markov decision processes: discrete stochastic dynamic programming”, 2014
- “Model selection for offline reinforcement learning: Practical considerations for healthcare settings” In Machine Learning for Healthcare Conference, 2021 PMLR
- “Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage” In International Conference on Learning Representations, 2022
- “First-order regret in reinforcement learning with linear function approximation: A robust estimation approach” In International Conference on Machine Learning, 2022 PMLR
- Martin J Wainwright “High-dimensional statistics: A non-asymptotic viewpoint”, 2019
- “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems” In IEEE Transactions on Smart Grid, 2019
- “Learning infinite-horizon average-reward mdps with linear function approximation” In International Conference on Artificial Intelligence and Statistics, 2021 PMLR
- “Bellman-consistent pessimism for offline reinforcement learning” In Advances in neural information processing systems, 2021
- “Q* approximation schemes for batch reinforcement learning: A theoretical comparison” In Conference on Uncertainty in Artificial Intelligence, 2020 PMLR
- Andrea Zanette “When is realizability sufficient for off-policy reinforcement learning?” In International Conference on Machine Learning, 2023 PMLR
- Andrea Zanette, Martin J Wainwright and Emma Brunskill “Provable benefits of actor-critic methods for offline reinforcement learning” In Advances in neural information processing systems, 2021
- “Offline reinforcement learning with realizability and single-policy concentrability” In Conference on Learning Theory, 2022 PMLR
- Hanlin Zhu, Paria Rashidinejad and Jiantao Jiao “Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning” In arXiv preprint arXiv:2301.12714, 2023