Policy Gradient for Rectangular Robust Markov Decision Processes (2301.13589v2)
Abstract: Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.
- Fast algorithms for ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-constrained s-rectangular robust MDPs. Advances in Neural Information Processing Systems, 34:25982–25992, 2021.
- Your policy regularizer is secretly an adversary. Transactions on Machine Learning Research (TMLR), 2022.
- Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34:22274–22287, 2021.
- Soft-robust actor-critic policy-gradient. AUAI press for Association for Uncertainty in Artificial Intelligence, pages 208–218, 2018.
- Twice regularized Markov decision processes: The equivalence between robustness and regularization. arXiv preprint arXiv:2303.06654, 2023.
- Maximum entropy RL (provably) solves some robust RL problems. International Conference on Learning Representations, 2022.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Fast Bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR, 2018.
- Partial policy iteration for l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-robust Markov decision processes. J. Mach. Learn. Res., 22:275–1, 2021.
- Robust ϕitalic-ϕ\phiitalic_ϕ-divergence MDPs. arXiv preprint arXiv:2205.14202, 2022.
- Regularized policies are reward robust. In International Conference on Artificial Intelligence and Statistics, pages 64–72. PMLR, 2021.
- Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
- Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning. Citeseer, 2002.
- Efficient policy iteration for robust markov decision processes via regularization, 2022.
- Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466–1473, 2018.
- First-order policy optimization for robust Markov decision process. arXiv preprint arXiv:2209.10579, 2022.
- Learning robust options. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
- Envelope theorems for arbitrary choice sets. Econometrica, 70:583–601, 02 2002.
- Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
- Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817–2826. PMLR, 2017.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Reinforcement learning under model mismatch. Advances in Neural Information Processing Systems, 2017.
- Equivalence between policy gradients and soft q-learning, 2017.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057–1063. Citeseer, 1999.
- Policy gradient for coherent risk measures. Advances in neural information processing systems, 28, 2015.
- Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189. PMLR, 2014.
- Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR, 2019.
- Policy gradient in robust MDPs with global convergence guarantee. In International Conference on Machine Learning, pages 35763–35797. PMLR, 2023.
- Robust constrained reinforcement learning. arXiv preprint arXiv:2209.06866, 2022.
- Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526. PMLR, 2022.
- Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
- Robustness and generalization. Machine learning, 86:391–423, 2012.