Optimistic Model Rollouts for Pessimistic Offline Policy Optimization (2401.05899v1)
Abstract: Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
- Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34: 7436–7447.
- Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19): 1876–1902.
- Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. International Conference on Learning Representations.
- Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3): 33–57.
- Model-Augmented Actor-Critic: Backpropagating through Paths. In International Conference on Learning Representations.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34: 20132–20145.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2052–2062. PMLR.
- Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1): 16–18.
- Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35: 449–461.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
- When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32.
- Is Q-learning provably efficient? NeurIPS, 31.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2137–2143. PMLR.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, 5084–5096. PMLR.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810–21823.
- Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191.
- Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 1702–1712. PMLR.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- A Review of Uncertainty for Deep Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, 155–162.
- Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations.
- Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees. In International Conference on Learning Representations.
- Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. In Advances in Neural Information Processing Systems.
- Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 4414–4420. IEEE.
- Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, 308–322. Springer.
- Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1): 1–118.
- Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems, 33: 10537–10546.
- Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, 1154–1168. PMLR.
- Rambo-rl: Robust adversarial model-based offline reinforcement learning. Advances in neural information processing systems, 35: 16082–16097.
- Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, 216–224. Elsevier.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026–5033. IEEE.
- Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34: 29420–29432.
- Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. In International Conference on Machine Learning, 11319–11328. PMLR.
- Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2(5): 6.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33: 14129–14142.
- Yuanzhao Zhai (10 papers)
- Yiying Li (12 papers)
- Zijian Gao (22 papers)
- Xudong Gong (4 papers)
- Kele Xu (62 papers)
- Dawei Feng (19 papers)
- Ding Bo (1 paper)
- Huaimin Wang (37 papers)