Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization (2401.05899v1)

Published 11 Jan 2024 in cs.LG

Abstract: Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
  2. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34: 7436–7447.
  3. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19): 1876–1902.
  4. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. International Conference on Learning Representations.
  5. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3): 33–57.
  6. Model-Augmented Actor-Critic: Backpropagating through Paths. In International Conference on Learning Representations.
  7. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
  8. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34: 20132–20145.
  9. Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
  10. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2052–2062. PMLR.
  11. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1): 16–18.
  12. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35: 449–461.
  13. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
  14. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32.
  15. Is Q-learning provably efficient? NeurIPS, 31.
  16. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2137–2143. PMLR.
  17. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 5084–5096. PMLR.
  18. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810–21823.
  19. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations.
  20. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
  21. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191.
  22. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 1702–1712. PMLR.
  23. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  24. A Review of Uncertainty for Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, 155–162.
  25. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations.
  26. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations.
  27. Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. In Advances in Neural Information Processing Systems.
  28. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 4414–4420. IEEE.
  29. Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, 308–322. Springer.
  30. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1): 1–118.
  31. Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems, 33: 10537–10546.
  32. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, 1154–1168. PMLR.
  33. Rambo-rl: Robust adversarial model-based offline reinforcement learning. Advances in neural information processing systems, 35: 16082–16097.
  34. Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, 216–224. Elsevier.
  35. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026–5033. IEEE.
  36. Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34: 29420–29432.
  37. Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. In International Conference on Machine Learning, 11319–11328. PMLR.
  38. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2(5): 6.
  39. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967.
  40. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33: 14129–14142.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuanzhao Zhai (10 papers)
  2. Yiying Li (12 papers)
  3. Zijian Gao (22 papers)
  4. Xudong Gong (4 papers)
  5. Kele Xu (62 papers)
  6. Dawei Feng (19 papers)
  7. Ding Bo (1 paper)
  8. Huaimin Wang (37 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets