Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes
Abstract: Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between $\gamma$-discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.
- Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 22–31.
- Lipschitz continuity in model-based reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 264–273.
- Approximately optimal approximate reinforcement learning. In Machine Learning, Proceedings of the Nineteenth International Conference (ICML), pages 267–274. Morgan Kaufmann.
- Metelli, A. M. (2021). Exploiting Environment Configurability in Reinforcement Learning. PhD thesis, Politecnico di Milano.
- Metelli, A. M. (2022). Configurable environments in reinforcement learning: An overview. Special Topics in Information Technology, pages 101–113.
- Reinforcement learning in configurable continuous environments. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 4546–4555.
- Policy space identification in configurable environments. Mach. Learn., 111(6):2093–2145.
- Control frequency adaptation via action persistence in batch reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 6862–6873.
- Configurable markov decision processes. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 3488–3497.
- Safe policy iteration: A monotonically improving approximate policy iteration approach. J. Mach. Learn. Res., 22:97:1–97:83.
- Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857.
- Policy gradient in lipschitz markov decision processes. Mach. Learn., 100(2-3):255–283.
- Safe policy iteration. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 307–315.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley.
- On the locality of action domination in sequential decision making. In International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010.
- Learning in non-cooperative configurable markov decision processes. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 22808–22821.
- Truly deterministic policy optimization. CoRR, abs/2205.15379.
- Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889–1897.
- Reinforcement learning: An introduction. MIT press.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, (NIPS), pages 1057–1063. The MIT Press.
- Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.