Corruption-Robust Offline Reinforcement Learning (2106.06630v1)

Published 11 Jun 2021 in cs.LG

Abstract: We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed to arbitrarily modify $\epsilon$ fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in linear MDP of dimension $d$, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $\epsilon$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown $\epsilon$ is impossible.This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.

Authors (4)

Xuezhou Zhang (36 papers)
Yiding Chen (15 papers)
Jerry Zhu (3 papers)
Wen Sun (124 papers)

Citations (36)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Corruption-Robust Offline Reinforcement Learning (2106.06630v1)

Summary

Related Papers