Optimistic Model Rollouts for Pessimistic Offline Policy Optimization (2401.05899v1)

Published 11 Jan 2024 in cs.LG

Abstract: Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

References (40)

Authors (8)

Yuanzhao Zhai (10 papers)
Yiying Li (12 papers)
Zijian Gao (22 papers)
Xudong Gong (4 papers)
Kele Xu (62 papers)
Dawei Feng (19 papers)
Ding Bo (1 paper)
Huaimin Wang (37 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization (2401.05899v1)

Summary

Related Papers

Tweets