Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

First-order Policy Optimization for Robust Markov Decision Process (2209.10579v2)

Published 21 Sep 2022 in cs.LG, cs.AI, and math.OC

Abstract: We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity for finding an $\epsilon$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/\epsilon2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  2. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR, 2021.
  3. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 2021.
  4. A lyapunov theory for finite-sample guarantees of asynchronous q-learning and td-learning variants. arXiv preprint arXiv:2102.01567, 2021.
  5. John M. Danskin. The theory of max-min and its application to weapons allocation problems. 1967.
  6. Twice regularized mdps and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 2021.
  7. A first-order approach to accelerated value iteration. Operations Research, 2022.
  8. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022.
  9. Scalable first-order methods for robust mdps. arXiv preprint arXiv:2005.05434, 2020.
  10. Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1–46, 2021.
  11. Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pages 492–518, 1992.
  12. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  13. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
  14. Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  15. Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410, 2013.
  16. On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424, 2021.
  17. Risk-averse learning by temporal difference methods. arXiv preprint arXiv:2003.00780, 2020.
  18. A Ya Kruger. On fréchet subdifferentials. Journal of Mathematical Sciences, 116(3):3325–3358, 2003.
  19. Efficient policy iteration for robust markov decision processes via regularization. arXiv preprint arXiv:2205.14327, 2022.
  20. Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
  21. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. arXiv preprint arXiv:2102.00135, 2021.
  22. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  23. Policy mirror descent inherently explores action space. arXiv preprint arXiv:2303.04386, 2023.
  24. Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2020.
  25. Homotopic policy mirror descent: Policy convergence, implicit regularization, and improved sample complexity. arXiv preprint arXiv:2201.09457, 2022.
  26. Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR, 2022.
  27. Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620, 2022.
  28. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
  29. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  30. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  31. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022.
  32. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
  33. Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
  34. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  35. R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.
  36. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017.
  37. Andrzej Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125(2):235–261, 2010.
  38. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
  39. Deep reinforcement learning with robust and smooth policy. In International Conference on Machine Learning, pages 8707–8718. PMLR, 2020.
  40. Leon Simon. Lectures on Geometric Measure Theory, volume 3. The Australian National University, Mathematical Sciences Institute, Centre for Mathematics and its Applications, 1 1983.
  41. Leon Simon. Introduction to geometric measure theory. Tsinghua Lectures, 2(2):3–1, 2014.
  42. Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR, 2014.
  43. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  44. Jean-Philippe Vial. Strong convexity of sets and functions. Journal of Mathematical Economics, 9(1-2):187–205, 1982.
  45. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34, 2021.
  46. Policy gradient method for robust reinforcement learning. arXiv preprint arXiv:2205.07344, 2022.
  47. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  48. Lin Xiao. On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443, 2022.
  49. Robust regression and lasso. Advances in neural information processing systems, 21, 2008.
  50. Robustness and regularization of support vector machines. Journal of machine learning research, 10(7), 2009.
  51. Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956, 2020.
  52. Rorl: Robust offline reinforcement learning via conservative smoothing. arXiv preprint arXiv:2206.02829, 2022.
  53. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021.
Citations (22)

Summary

We haven't generated a summary for this paper yet.