Corruption-robust exploration in episodic reinforcement learning

Published 20 Nov 2019 in cs.LG, cs.AI, cs.DS, and stat.ML | (1911.08689v4)

Abstract: We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of uncertainty", by complementing them with principles from "action elimination". Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear-function-approximation settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.

Abstract PDF Upgrade to Chat

Citations (102)

View on Semantic Scholar

Summary

The paper introduces a corruption-robust exploration strategy that modifies optimism with action elimination to ensure reliable performance in episodic RL.
It presents a hierarchical framework of base learners that maintains efficient regret bounds in both tabular and linear MDP settings despite adversarial corruption.
The approach offers practical insights for robust decision-making in real-world applications like robotics and healthcare, paving the way for future research in non-stationary environments.

Corruption-Robust Exploration in Episodic Reinforcement Learning

Episodic Reinforcement Learning (RL) is a fundamental paradigm in which learning occurs over discrete episodes of interaction with an environment. The typical assumption is that the transitions and rewards are drawn from stationary, i.i.d. distributions. However, real-world applications frequently violate these assumptions due to various corrupting influences such as fraud, adversarial attacks, or non-compliance in experimental settings, leading to degraded performance of standard algorithms. The paper "Corruption-robust exploration in episodic reinforcement learning" focuses on establishing robust methods for RL that can handle adversarial corruptions in both the rewards and the state-transition dynamics.

Problem and Objectives

The paper addresses the challenge of developing corruption-robust algorithms for episodic RL. These algorithms need to maintain performance close to optimal in the uncontaminated setting and degrade gracefully in the presence of adversarial corruption. The authors aim to offer a framework that incorporates robustness into the optimistic exploration strategies prevalent in RL, adapting them to handle unknown levels of corruption.

Methodology

The methodology revolves around modifying the "optimism in the face of uncertainty" principle, which traditionally selects actions based on optimistic estimates of their outcome. This paper complements this principle by integrating strategies from "action elimination," where non-promising actions are systematically removed based on given confidence bounds.

Algorithmic Framework

The paper introduces a hierarchical approach involving multiple layers of base learners. Each learner operates with progressively increasing tolerance to corruption, and its dataset is subsampled to match this tolerance. The essential feature is a supervision hierarchy, where more robust base learners maintain confidence bounds that help less robust ones avoid corruption-induced error propagation, while ensuring efficient regret bounds even under major corruptions.

Analysis

The authors provide a comprehensive analytical framework termed "SuperVIsed.Time" to formalize the relationship between optimistic and supervised algorithms. This includes establishing confidence admissibility conditions that facilitate safe action selection, even in corrupted environments. The analysis provides statistical guarantees across tabular and linear-function-approximation settings, ensuring efficient regret rates in episodic RL with non-i.i.d. transitions.

Results

The algorithms proposed achieve significant results:

Tabular Setting: The modified algorithm results in regret scaling optimally with the number of states, actions, and episodes, deteriorating gracefully with the corruption level $C$ . Importantly, the work provides new guarantees for logarithmic gap-dependent regret bounds, which were not previously established for corruption-robust RL.
Linear MDP Setting: Demonstrates polynomial dependence on the feature dimension $d$ rather than the state/action space size, resulting in efficient exploration and learning even for linear representations disturbed by adversarial influences.

Implications and Future Directions

The paper makes strong implications for understanding exploration in adversarial environments, fostering resilience in RL frameworks. Practically, this enhances RL potentially in multiple domains, from robotics (where environments can be adversarial or dynamically change) to healthcare (where patient compliance might constitute a form of corruption).

Theoretically, this work calls for further investigation into algorithmic strategies for handling non-stationarity and adversarial corruption in broader settings beyond episodic RL.

Future directions may include:

Extending the models to infinite-horizon frameworks or continuous-space environments where corruptions, dynamic changes, and adversarial inputs are prevalent.
Exploring the integration of multi-task and transfer learning to leverage learned robustness across various corrupted settings.

This paper thus represents a significant step in making reinforcement learning robust to real-world challenges, contributing both practical algorithms and theoretical insights into the field of RL under non-stationary and adversarial conditions.

Markdown