Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization (2309.01107v2)

Published 3 Sep 2023 in cs.LG

Abstract: In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $\alpha$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1): 4431–4506.
  2. Borkar, V. S. 2022. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer Nature.
  3. Your Policy Regularizer is Secretly an Adversary. Transactions on Machine Learning Research (TMLR).
  4. OpenAI Gym. arXiv preprint arXiv:1606.01540.
  5. Percentile optimization for Markov decision processes with parameter uncertainty. Operations research, 58(1): 203–213.
  6. Twice regularized MDPs and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34: 22274–22287.
  7. Efron, B. 1992. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, 569–593. Springer.
  8. Reinforcement learning with a corrupted reward channel. arXiv preprint arXiv:1705.08417.
  9. Maximum Entropy RL (Provably) Solves Some Robust RL Problems. International Conference on Learning Representations.
  10. Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization. arXiv preprint arXiv:2309.01107.
  11. A theory of regularized Markov decision processes. In International Conference on Machine Learning, 2160–2169. PMLR.
  12. Robust Markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 48(1): 203–226.
  13. Fast Bellman updates for robust MDPs. In International Conference on Machine Learning, 1979–1988. PMLR.
  14. Partial Policy Iteration for l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Robust Markov Decision Processes. J. Mach. Learn. Res., 22: 275–1.
  15. Manipulating reinforcement learning: Poisoning attacks on cost signals. arXiv preprint arXiv:2002.03827.
  16. Reinforcement learning for linear quadratic control is vulnerable under cost manipulation. arXiv preprint arXiv:2203.05774.
  17. Regularized Policies are Reward Robust. In International Conference on Artificial Intelligence and Statistics, 64–72. PMLR.
  18. Iyengar, G. N. 2005. Robust dynamic programming. Mathematics of Operations Research, 30(2): 257–280.
  19. Policy gradient for s-rectangular robust markov decision processes. arXiv preprint arXiv:2301.13589.
  20. Efficient Policy Iteration for Robust Markov Decision Processes via Regularization. arXiv preprint arXiv:2205.14327.
  21. Towards Faster Global Convergence of Robust Policy Gradient Methods. In Sixteenth European Workshop on Reinforcement Learning.
  22. First-order Policy Optimization for Robust Markov Decision Process. arXiv preprint arXiv:2209.10579.
  23. Reinforcement Learning in Robust Markov Decision Processes. Mathematics of Operations Research, 41(4): 1325–1353.
  24. Regularization matters in policy optimization. arXiv preprint arXiv:1910.09191.
  25. Lightning does not strike twice: Robust MDPs with coupled uncertainty. arXiv preprint arXiv:1206.4643.
  26. Robust MDPs with K-Rectangular Uncertainty. Math. Oper. Res., 41(4): 1484–1509.
  27. Bias and Variance in Value Function Estimation. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 72. New York, NY, USA: Association for Computing Machinery. ISBN 1581138385.
  28. Envelope Theorems for Arbitrary Choice Sets. Econometrica, 70: 583–601.
  29. Moore, A. W. 1990. Efficient Memory-based Learning for Robot Control. Technical report, University of Cambridge.
  30. Online Defense Strategies for Reinforcement Learning Against Adaptive Reward Poisoning. In 26th International Conference on Artificial Intelligence and Statistics. PMRL.
  31. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5): 780–798.
  32. Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  33. Raffin, A. 2020. RL Baselines3 Zoo. https://github.com/DLR-RM/rl-baselines3-zoo.
  34. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, 7974–7984. PMLR.
  35. Rudin, W. 1987. Real and complex analysis. 1987. Cited on, 156.
  36. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  37. Simons, S. 1995. Minimax theorems and their proofs. In Minimax and applications, 1–23. Springer.
  38. Reinforcement Learning: An Introduction. The MIT Press, second edition.
  39. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, 1057–1063. Citeseer.
  40. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026–5033. IEEE.
  41. Reinforcement learning with perturbed rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 6202–6209.
  42. Policy Gradient in Robust MDPs with Global Convergence Guarantee. Proceedings of the 40th International Conference on Machine Learning, PMLR 202:35763-35797.
  43. Policy Gradient Method For Robust Reinforcement Learning. International Conference on Machine Learning, 162: 23484–23526.
  44. Robust Markov decision processes. Mathematics of Operations Research, 38(1): 153–183.
  45. Xiao, L. 2022. On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1): 12887–12922.
  46. Distributionally robust Markov decision processes. Advances in Neural Information Processing Systems, 23.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Uri Gadot (6 papers)
  2. Esther Derman (13 papers)
  3. Navdeep Kumar (11 papers)
  4. Maxence Mohamed Elfatihi (2 papers)
  5. Kfir Levy (6 papers)
  6. Shie Mannor (228 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.