First-order Policy Optimization for Robust Markov Decision Process (2209.10579v2)
Abstract: We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity for finding an $\epsilon$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/\epsilon2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
- Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR, 2021.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 2021.
- A lyapunov theory for finite-sample guarantees of asynchronous q-learning and td-learning variants. arXiv preprint arXiv:2102.01567, 2021.
- John M. Danskin. The theory of max-min and its application to weapons allocation problems. 1967.
- Twice regularized mdps and the equivalence between robustness and regularization. Advances in Neural Information Processing Systems, 34, 2021.
- A first-order approach to accelerated value iteration. Operations Research, 2022.
- Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022.
- Scalable first-order methods for robust mdps. arXiv preprint arXiv:2005.05434, 2020.
- Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1–46, 2021.
- Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pages 492–518, 1992.
- Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
- Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410, 2013.
- On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424, 2021.
- Risk-averse learning by temporal difference methods. arXiv preprint arXiv:2003.00780, 2020.
- A Ya Kruger. On fréchet subdifferentials. Journal of Mathematical Sciences, 116(3):3325–3358, 2003.
- Efficient policy iteration for robust markov decision processes via regularization. arXiv preprint arXiv:2205.14327, 2022.
- Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
- Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. arXiv preprint arXiv:2102.00135, 2021.
- Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Policy mirror descent inherently explores action space. arXiv preprint arXiv:2303.04386, 2023.
- Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2020.
- Homotopic policy mirror descent: Policy convergence, implicit regularization, and improved sample complexity. arXiv preprint arXiv:2201.09457, 2022.
- Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR, 2022.
- Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620, 2022.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
- Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
- Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.
- Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017.
- Andrzej Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125(2):235–261, 2010.
- Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
- Deep reinforcement learning with robust and smooth policy. In International Conference on Machine Learning, pages 8707–8718. PMLR, 2020.
- Leon Simon. Lectures on Geometric Measure Theory, volume 3. The Australian National University, Mathematical Sciences Institute, Centre for Mathematics and its Applications, 1 1983.
- Leon Simon. Introduction to geometric measure theory. Tsinghua Lectures, 2(2):3–1, 2014.
- Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR, 2014.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Jean-Philippe Vial. Strong convexity of sets and functions. Journal of Mathematical Economics, 9(1-2):187–205, 1982.
- Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34, 2021.
- Policy gradient method for robust reinforcement learning. arXiv preprint arXiv:2205.07344, 2022.
- Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
- Lin Xiao. On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443, 2022.
- Robust regression and lasso. Advances in neural information processing systems, 21, 2008.
- Robustness and regularization of support vector machines. Journal of machine learning research, 10(7), 2009.
- Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956, 2020.
- Rorl: Robust offline reinforcement learning via conservative smoothing. arXiv preprint arXiv:2206.02829, 2022.
- Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021.