On the convex formulations of robust Markov decision processes (2209.10187v2)
Abstract: Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s-rectangularity assumptions. By using entropic regularization and exponential change of variables, we derive a convex formulation with a number of variables and constraints polynomial in the number of states and actions, but with large coefficients in the constraints. We further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or entropy-based uncertainty sets, showing that, in these cases, RMDPs can be reformulated as conic programs based on exponential cones, quadratic cones, and non-negative orthants. Our work opens a new research direction for RMDPs and can serve as a first step toward obtaining a tractable convex formulation of RMDPs.
- Ahmadi-Javid A (2012) Entropic Value-at-Risk: A new coherent risk measure. Journal of Optimization Theory and Applications 155(3):1105–1123.
- ApS M (2022) MOSEK Optimizer API for Python 10.1.17. URL https://docs.mosek.com/latest/pythonapi/index.html.
- Asadi K, Littman ML (2017) An alternative softmax operator for reinforcement learning. International Conference on Machine Learning.
- Bellman R (1966) Dynamic programming. Science 153(3731):34–37.
- Ben-Tal A, Nemirovski A (2000) Robust solutions of linear programming problems contaminated with uncertain data. Mathematical programming 88(3):411–424.
- Ben-Tal A, Teboulle M (1996) Hidden convexity in some nonconvex quadratically constrained quadratic programming. Mathematical Programming 72(1):51–63.
- Bennett CC, Hauser K (2013) Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach. Artificial intelligence in medicine 57(1):9–19.
- Borkar VS (2002) Q-learning for risk-sensitive control. Mathematics of operations research 27(2):294–311.
- Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge university press).
- Chandrasekaran V, Shah P (2017) Relative entropy optimization and its applications. Mathematical Programming 161:1–32.
- Condon A (1993) On algorithms for simple stochastic games. Advances in computational complexity theory 13:51–72.
- Dahl J, Andersen ED (2022) A primal-dual interior-point algorithm for nonsymmetric exponential-cone optimization. Mathematical Programming 194(1-2):341–370.
- Dantzig G (1963) Linear programming and extensions. Linear programming and extensions (Princeton university press).
- Delage E, Mannor S (2010) Percentile optimization for Markov decision processes with parameter uncertainty. Operations research 58(1):203–213.
- d’Epenoux F (1960) Sur un probleme de production et de stockage dans l’aléatoire. Revue Française de Recherche Opérationelle 14(3-16):4.
- Dupuis P, Ellis RS (1997) A Weak Convergence Approach to the Theory of Large Deviations. (Wiley).
- Eysenbach B, Levine S (2021) Maximum entropy RL (provably) solves some robust RL problems. arXiv preprint arXiv:2103.06257 .
- Filar J, Vrieze K (1997) Competitive Markov Decision Processes (Springer).
- Föllmer H, Schied A (2002) Convex measures of risk and trading constraints. Finance and stochastics 6(4):429–447.
- Gao Y, Kroer C (2023) Infinite-dimensional fisher markets and tractable fair division. Operations Research 71(2):688–707.
- Goyal V, Grand-Clément J (2023a) A first-order approach to accelerated value iteration. Operations Research 71(2):517––535.
- Goyal V, Grand-Clément J (2023b) Robust Markov decision processes: Beyond rectangularity. Mathematics of Operations Research 48(1):203–226.
- Grand-Clément J (2021) From convex optimization to MDPs: A review of first-order, second-order and quasi-Newton methods for MDPs. arXiv preprint arXiv:2104.10677 .
- Howard RA (1960) Dynamic programming and Markov processes. .
- Howard RA, Matheson JE (1972) Risk-sensitive Markov decision processes. Management Science 18(7):356–369.
- Iyengar GN (2005) Robust dynamic programming. Mathematics of Operations Research 30(2):257–280.
- Kakade SM (2001) A natural policy gradient. Advances in neural information processing systems 14.
- Kaufman DL, Schaefer AJ (2013) Robust modified policy iteration. INFORMS Journal on Computing 25(3):396–410.
- Nilim A, El Ghaoui L (2005) Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5):780–798.
- Porteus EL (1990) Stochastic inventory theory. Handbooks in operations research and management science 2:605–652.
- Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley and Sons).
- Rockafellar RT (1970) Convex Analysis.
- Satia JK, Lave Jr RE (1973) Markovian decision processes with uncertain transition probabilities. Operations Research 21(3):728–740.
- Scarf HE (1958) A min-max solution of an inventory problem. Studies in the Mathematical Theory of Inventory and Production.
- Schewe S (2009) From parity and payoff games to linear programming. International Symposium on Mathematical Foundations of Computer Science, 675–686 (Springer).
- Shapley LS (1953) Stochastic games. Proceedings of the national academy of sciences 39(10):1095–1100.
- Solan E, Vieille N (2015) Stochastic games. Proceedings of the National Academy of Sciences 112(45):13743–13746.
- Sutton RS, Barto AG (2018) Reinforcement learning: An introduction (MIT press).
- Szepesvári C (2010) Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning 4(1):1–103.
- Todorov E (2006) Linearly-solvable Markov decision problems. Advances in neural information processing systems 19.
- White C, Eldeib H (1994) Markov decision processes with imprecise transition probabilities. Operations Research 42(4):739–749.
- Ye Y (2005) A new complexity result on solving the Markov decision problem. Mathematics of Operations Research 30(3):733–749.
- Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research 36(4):593–603.
- Yu P, Xu H (2015) Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control 61(9):2538–2543.
- Zwick U, Paterson M (1996) The complexity of mean payoff games on graphs. Theoretical Computer Science 158(1-2):343–359.