Safe Policy Improvement with Baseline Bootstrapping (1712.06924v5)

Published 19 Dec 2017 in cs.LG, cs.AI, and stat.ML

Abstract: This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data. Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high. Our first algorithm, $\Pi_b$-SPIBB, comes with SPI theoretical guarantees. We also implement a variant, $\Pi_{\leq b}$-SPIBB, that is even more efficient in practice. We apply our algorithms to a motivational stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB with respect to existing algorithms, not only in safety but also in mean performance. Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm relying on a neural network representation able to train efficiently and reliably from batch data, without any interaction with the environment.

Citations (190)

View on Semantic Scholar

Summary

The paper proposes Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a novel methodology ensuring learned policies in Batch Reinforcement Learning perform at least as well as the baseline policy.
SPIBB offers theoretical guarantees of safe policy improvement through methods like $\Pi_b$-SPIBB and $\Pi_{\leq b}$-SPIBB, balancing safety constraints with empirical efficiency.
Empirical evaluations demonstrate that SPIBB methods achieve competitive performance while consistently maintaining safety guarantees, extending to a model-free deep reinforcement learning variant, SPIBB-DQN.

Safe Policy Improvement with Baseline Bootstrapping

The paper "Safe Policy Improvement with Baseline Bootstrapping" addresses the challenge of achieving safe policy improvements in Batch Reinforcement Learning (Batch RL). Batch RL is characterized by operating in offline settings, where interactions with the environment during training are not allowed. This presents unique challenges, as the training must proceed based solely on a fixed dataset collected under a known baseline policy. The authors propose a novel methodology named Safe Policy Improvement with Baseline Bootstrapping (SPIBB), designed to ensure that any new policy trained from this dataset performs at least as well as the baseline policy.

Key to the SPIBB framework is the notion of bootstrapping from the baseline when the uncertainty about the learned policy's performance is high. This approach aligns with the knows-what-it-knows paradigm, where the baseline policy is utilized in parts of the decision space suffering from scarce data, effectively integrating the certainty provided by the baseline into the new policy.

The authors introduce two main implementations of SPIBB. The first, $\Pi_b$ -SPIBB, provides theoretical guarantees of Safe Policy Improvement (SPI). It constrains the learned policy to deviate from the baseline only where the data is sufficiently informative, as dictated by the bootstrapped set, characterized by state-action pairs with counts beneath a certain threshold. The second, $\Pi_{\leq b}$ -SPIBB, relaxes these constraints further, allowing deviations from baseline actions as long as they don’t exceed the baseline's probability, proving to be more efficient in empirical tests.

Empirical evaluations were conducted in simulated environments, including a benchmark gridworld setup and random MDPs, demonstrating that SPIBB approaches achieve competitive average performance while consistently providing safety guarantees. Notably, SPIBB algorithms were found to maintain baseline-level performance while outperforming the baseline in terms of accumulated reward in regions with adequate data coverage.

Additionally, SPIBB extends into the domain of deep reinforcement learning with a model-free variant, SPIBB-DQN, which demonstrates the capability to train reliably from batch data using neural networks — a first for reinforcement learning methods without environmental interactions.

The theoretical underpinning of SPIBB provides a finite-sample guarantee through a PAC (Probably Approximately Correct) framework demonstrating that the policy improvement is approximately safe, with any deviation from the baseline's performance being provably bounded. In practice, the balance between exploration and exploitation is managed by selecting an appropriate threshold for the bootstrapped set, making the method not only theoretically sound but also practically viable across various domains.

Future research directions proposed include further exploration of model-free implementations, scaling to larger and more complex state-action spaces, and applying robust estimation techniques to improve the generalization from batch data. The SPIBB methodology charts a promising path toward broader deployment of reinforcement learning in real-world applications where safety and reliability are paramount.

Safe Policy Improvement with Baseline Bootstrapping (1712.06924v5)

Summary

Safe Policy Improvement with Baseline Bootstrapping

Related Papers