Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy Gradient for Rectangular Robust Markov Decision Processes (2301.13589v2)

Published 31 Jan 2023 in cs.LG and cs.AI

Abstract: Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Fast algorithms for ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-constrained s-rectangular robust MDPs. Advances in Neural Information Processing Systems, 34:25982–25992, 2021.
  2. Your policy regularizer is secretly an adversary. Transactions on Machine Learning Research (TMLR), 2022.
  3. Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34:22274–22287, 2021.
  4. Soft-robust actor-critic policy-gradient. AUAI press for Association for Uncertainty in Artificial Intelligence, pages 208–218, 2018.
  5. Twice regularized Markov decision processes: The equivalence between robustness and regularization. arXiv preprint arXiv:2303.06654, 2023.
  6. Maximum entropy RL (provably) solves some robust RL problems. International Conference on Learning Representations, 2022.
  7. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  8. Fast Bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR, 2018.
  9. Partial policy iteration for l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-robust Markov decision processes. J. Mach. Learn. Res., 22:275–1, 2021.
  10. Robust ϕitalic-ϕ\phiitalic_ϕ-divergence MDPs. arXiv preprint arXiv:2205.14202, 2022.
  11. Regularized policies are reward robust. In International Conference on Artificial Intelligence and Statistics, pages 64–72. PMLR, 2021.
  12. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  13. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning. Citeseer, 2002.
  14. Efficient policy iteration for robust markov decision processes via regularization, 2022.
  15. Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466–1473, 2018.
  16. First-order policy optimization for robust Markov decision process. arXiv preprint arXiv:2209.10579, 2022.
  17. Learning robust options. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  18. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
  19. Envelope theorems for arbitrary choice sets. Econometrica, 70:583–601, 02 2002.
  20. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  21. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817–2826. PMLR, 2017.
  22. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  23. Reinforcement learning under model mismatch. Advances in Neural Information Processing Systems, 2017.
  24. Equivalence between policy gradients and soft q-learning, 2017.
  25. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  26. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057–1063. Citeseer, 1999.
  27. Policy gradient for coherent risk measures. Advances in neural information processing systems, 28, 2015.
  28. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189. PMLR, 2014.
  29. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR, 2019.
  30. Policy gradient in robust MDPs with global convergence guarantee. In International Conference on Machine Learning, pages 35763–35797. PMLR, 2023.
  31. Robust constrained reinforcement learning. arXiv preprint arXiv:2209.06866, 2022.
  32. Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526. PMLR, 2022.
  33. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  34. Robustness and generalization. Machine learning, 86:391–423, 2012.
Citations (13)

Summary

We haven't generated a summary for this paper yet.