Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration (2312.14470v1)
Abstract: This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state or knows a {\it safe graph} in which all the state-action-state triples are safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper, we consider safe RL with instantaneous hard constraints without assumption $(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d3H4K})$ regret and $\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon $K$, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.
- Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24.
- Linear stochastic bandits under safety constraints. In Advances Neural Information Processing Systems (NeurIPS), 9256–9266.
- Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning, 243–253. PMLR.
- Learning dexterous in-hand manipulation. The Int. Journal of Robotics Research, 39(1): 3–20.
- Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In AAAI Conf. Artificial Intelligence, 3682–3689.
- Safe exploration for constrained reinforcement learning with provable guarantees. arXiv preprint arXiv:2112.00885.
- Efficient algorithms for budget-constrained markov decision processes. IEEE Transactions on Automatic Control, 59(10): 2813–2817.
- Prediction, Learning, and Games. Cambridge University Press.
- Learning Infinite-horizon Average-reward Markov Decision Process with Constraints. In Int. Conf. Machine Learning (ICML), 3246–3270. PMLR.
- On kernelized multi-armed bandits. In Int. Conf. Machine Learning (ICML), 844–853. PMLR.
- Provably Efficient Safe Exploration via Primal-Dual Policy Optimization. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), volume 130, 3304–3312. PMLR.
- Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes. In Advances Neural Information Processing Systems (NeurIPS), volume 33, 8378–8390. Curran Associates, Inc.
- Convergence and optimality of policy gradient primal-dual method for constrained Markov decision processes. In acc, 2851–2856. IEEE.
- Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
- Provably Efficient Model-Free Constrained RL with Linear Function Approximation. In NeurIPS.
- Online Convex Optimization with Hard Constraints: Towards the Best of Two Worlds and Beyond. In Advances Neural Information Processing Systems (NeurIPS).
- Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints. arXiv preprint arXiv:2211.14720.
- Nearly minimax optimal reinforcement learning for linear markov decision processes. In Int. Conf. Machine Learning (ICML), 12790–12822. PMLR.
- Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, 8971–9019. PMLR.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34: 13406–13418.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2137–2143. PMLR.
- Bandit Algorithms. Cambridge University Press.
- Non-stationary bandits with knapsacks. In Advances Neural Information Processing Systems (NeurIPS), volume 35, 16522–16532.
- Learning policies with zero or bounded constraint violation for constrained MDPs. In Advances Neural Information Processing Systems (NeurIPS), volume 34.
- An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints. In Advances Neural Information Processing Systems (NeurIPS).
- Stochastic Bandits with Linear Constraints. In Int. Conf. Artificial Intelligence and Statistics (AISTATS).
- A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints. iclm.
- Mastering the game of go without human knowledge. nature, 550(7676): 354–359.
- Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435.
- Safe exploration in finite markov decision processes with gaussian processes. Advances in neural information processing systems, 29.
- Vershynin, R. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
- Safe exploration and optimization of constrained MDPs using Gaussian processes. In AAAI Conf. Artificial Intelligence, volume 32, 6548–6555. ISBN 978-1-57735-800-8.
- Provably Efficient Model-Free Algorithms for Non-stationary CMDPs. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), 6527–6570. PMLR.
- A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes. In AAAI Conf. Artificial Intelligence.
- Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation. In Int. Conf. Artificial Intelligence and Statistics (AISTATS).
- Budget constrained bidding by model-free reinforcement learning in display advertising. In Proc. ACM Int. Conf. Information and Knowledge Management (CIKM), 1443–1451.
- On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622.
- Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In Int. Conf. Machine Learning (ICML), 11998–12008. PMLR.
- Regret and cumulative constraint violation analysis for distributed online constrained convex optimization. IEEE Transactions on Automatic Control.
- A Low Complexity Algorithm with O(T)𝑂𝑇O(\sqrt{T})italic_O ( square-root start_ARG italic_T end_ARG ) Regret and O(1)𝑂1O(1)italic_O ( 1 ) Constraint Violations for Online Convex Optimization with Long Term Constraints. Journal of Machine Learning Research, 21(1): 1–24.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In colt, 4532–4576. PMLR.
- Honghao Wei (16 papers)
- Xin Liu (820 papers)
- Lei Ying (89 papers)