Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences (2002.09089v4)

Published 21 Feb 2020 in cs.LG and stat.ML

Abstract: Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose Bayesian Reward Extrapolation (Bayesian REX), a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference. Bayesian REX can learn to play Atari games from demonstrations, without access to the game score and can generate 100,000 samples from the posterior over reward functions in only 5 minutes on a personal laptop. Bayesian REX also results in imitation learning performance that is competitive with or better than state-of-the-art methods that only learn point estimates of the reward function. Finally, Bayesian REX enables efficient high-confidence policy evaluation without having access to samples of the reward function. These high-confidence performance bounds can be used to rank the performance and risk of a variety of evaluation policies and provide a way to detect reward hacking behaviors.

Authors (4)

Daniel S. Brown (46 papers)
Russell Coleman (1 paper)
Ravi Srinivasan (4 papers)
Scott Niekum (67 papers)

Citations (96)

View on Semantic Scholar

Summary

The paper presents Bayesian REX, a novel method that utilizes pairwise preferences to efficiently sample from the posterior reward distribution.
It uses pre-training to encode demonstrations into a low-dimensional latent space, enabling rapid Bayesian inference and safe policy evaluation.
Experimental results on Atari games show that Bayesian REX achieves performance exceeding that of human demonstrators using standard hardware.

Review of "Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences"

The paper, "Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences," presents a novel approach to address computational challenges associated with Bayesian reward learning in imitation learning. This approach, termed Bayesian Reward Extrapolation (Bayesian REX), adapts Bayesian inference techniques to high-dimensional imitation learning tasks by utilizing pairwise preferences over demonstrations. The proposed methodology is distinctive in its ability to efficiently generate samples from the posterior distribution over reward functions, which is central to implementing safety and uncertainty analysis in autonomous systems learning from human demonstrations.

Methodology and Results

Bayesian REX innovatively combines pre-training strategies that encode latent features into a low-dimensional space through self-supervised tasks, facilitating rapid Bayesian inference with respect to preferences observed in demonstrations. The algorithm effectively pre-trains a neural network model to capture relevant latent encodings of demonstration tasks, thereby allowing the final Bayesian inference to operate predominantly in this reduced feature space. This circumvents the computational limitations seen with traditional Bayesian IRL approaches, which necessitate solving a complex MDP solver in the inner loop — a task computationally prohibitive for high-dimensional environments.

The paper showcases the efficacy of Bayesian REX in learning competitive reward functions that support imitation learning performance at par or exceeding existing state-of-the-art methods that traditionally rely on point estimate reward learning. Numerical results substantiating Bayesian REX's adeptness include playing Atari games exclusively from demonstration preferences, with the system achieving better-than-demonstrator performance. These results exemplify Bayesian REX's robustness and computational efficiency, evidenced by its ability to produce a vast number of reward function samples from the posterior in mere minutes using standard personal computing hardware.

Implications

By facilitating efficient high-confidence policy evaluation without direct access to reward samples, Bayesian REX introduces a paradigm shift in imitation learning by incorporating explicit considerations for safety and reward hacking detection. The utilisation of high-confidence performance bounds also has significant implications for ranking evaluation policies and ensuring alignment with human intentions — a critical aspect when deploying autonomous systems in real-world scenarios.

Theoretical and Practical Significance

Theoretically, Bayesian REX strengthens the bridge between Bayesian inference and practical applicability in imitation learning scenarios, especially where sample-efficient and computationally feasible reward learning is critical. Practically, Bayesian REX could significantly reduce the operational overhead associated with deploying and training AI systems in high-dimensional settings such as real-time video game control and robotics, thereby potentiating broader applications in environments demanding real-time decision-making capabilities.

Future Directions

Further exploration into enhancing the interpretability of learned reward features, augmenting the robustness of Bayesian inference through adaptive pre-training strategies, and integrating active learning frameworks could propel the evolution of methodologies such as Bayesian REX. Additionally, examining the scalability of Bayesian REX across broader, more diverse application domains, coupled with reinforcement learning, could widen its scope and utility.

Bayesian REX stands as a promising contribution to imitation learning, effectively leveraging preferences in a principled Bayesian framework to proffer scalable and computationally efficient solutions for safe learning from demonstrations in complex environments.

PDF Markdown

Related Papers

YouTube

Show All Videos