Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

Published 21 Jan 2024 in cs.LG, math.ST, stat.ME, stat.ML, and stat.TH | (2401.11380v1)

Abstract: Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees of MoMA by proving an upper bound on the suboptimality of the returned policy. We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.

Authors (4)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129.
  2. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175.
  3. Adversarial model for offline reinforcement learning. arXiv preprint arXiv:2302.11048.
  4. Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965–979.
  5. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
  6. Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446.
  7. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
  8. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
  9. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
  10. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR.
  11. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press.
  12. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18.
  13. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35:449–461.
  14. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
  15. Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
  16. Finite-sample analysis of off-policy natural actor-critic algorithm. In International Conference on Machine Learning, pages 5420–5431. PMLR.
  17. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823.
  18. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
  19. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR.
  20. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
  21. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
  22. Lan, G. (2022). Policy optimization over general state and action spaces. arXiv preprint arXiv:2211.16715.
  23. Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pages 45–73. Springer.
  24. Safe policy improvement with baseline bootstrapping. In International conference on machine learning, pages 3652–3661. PMLR.
  25. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  26. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473.
  27. Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274.
  28. Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443.
  29. Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567. Citeseer.
  30. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5).
  31. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074.
  32. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716.
  33. Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716.
  34. Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, pages 317–328. Springer.
  35. Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581.
  36. Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007.
  37. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
  38. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pages 19967–20025. PMLR.
  39. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
  40. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE.
  41. Deep reinforcement learning for general video game ai. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE.
  42. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR.
  43. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations.
  44. Empirical Processes in M-estimation, volume 6. Cambridge university press.
  45. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press.
  46. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.
  47. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
  48. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
  49. Batch value-function approximation with only realizability. In International Conference on Machine Learning, pages 11404–11413. PMLR.
  50. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
  51. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967.
  52. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142.
  53. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34.
  54. Corruption-robust offline reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 5757–5773. PMLR.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 5 likes about this paper.