MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning
Abstract: Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees of MoMA by proving an upper bound on the suboptimality of the returned policy. We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.
- Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129.
- Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175.
- Adversarial model for offline reinforcement learning. arXiv preprint arXiv:2302.11048.
- Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965–979.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446.
- Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR.
- Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press.
- Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18.
- Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35:449–461.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
- Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
- Finite-sample analysis of off-policy natural actor-critic algorithm. In International Conference on Machine Learning, pages 5420–5431. PMLR.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
- Lan, G. (2022). Policy optimization over general state and action spaces. arXiv preprint arXiv:2211.16715.
- Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pages 45–73. Springer.
- Safe policy improvement with baseline bootstrapping. In International conference on machine learning, pages 3652–3661. PMLR.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473.
- Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274.
- Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443.
- Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567. Citeseer.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5).
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716.
- Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716.
- Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, pages 317–328. Springer.
- Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581.
- Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007.
- Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
- Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pages 19967–20025. PMLR.
- Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE.
- Deep reinforcement learning for general video game ai. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE.
- Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR.
- Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations.
- Empirical Processes in M-estimation, volume 6. Cambridge university press.
- Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.
- Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
- Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
- Batch value-function approximation with only realizability. In International Conference on Machine Learning, pages 11404–11413. PMLR.
- Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142.
- Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34.
- Corruption-robust offline reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.