PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training (2106.05091v1)

Published 9 Jun 2021 in cs.LG and cs.AI

Abstract: Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher's preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent's past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

PDF Abstract

An Expert Overview of PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

The paper introduces PEBBLE, an algorithm designed to enhance feedback efficiency in interactive reinforcement learning (RL) systems. It seeks to optimize the contribution of human feedback in the reinforcement learning process by combining unsupervised pre-training with preference-based learning and off-policy learning techniques.

PEBBLE addresses a common challenge in RL: effectively communicating complex objectives to an agent. Traditional approaches often rely on precisely designed reward functions, which can be intricate to construct and prone to exploitation by the agent. Human-in-the-loop (HiL) methods, where humans provide direct feedback to guide the agent, have emerged as a promising alternative. Yet, these methods are limited by the labor-intensive nature of continuous human feedback, making scalability an issue. PEBBLE proposes a more efficient process by integrating unsupervised exploration during the pre-training phase and relabeling experiences following any updates to the reward model.

This algorithm operates by leveraging both pre-training and ongoing preference-based feedback. The pre-training component encourages the agent to explore diverse states autonomously with the help of intrinsic motivation derived from state entropy. This intrinsic motivation is calculated using a particle-based entropy estimator and encourages state exploration without supervision. Consequently, at the onset of human interaction, the agent's behavior provides rich, informative experiences upon which preferences can be effectively queried.

In practice, PEBBLE implements a preference-based feedback mechanism where a human teacher provides binary preferences between pairs of behavior clips. The preferred behavior is then used to refine the reward model, shaping the agent’s learning trajectory. By continuously relabeling experiences in the replay buffer every time the reward model is updated, PEBBLE significantly stabilizes the learning process, paving the way for effective off-policy RL methods to be employed. This results in a learning paradigm that is sample-efficient and mitigates the potentially adverse effects of non-stationary rewards.

The empirical evaluations of PEBBLE demonstrate its efficacy across various tasks from the DeepMind Control Suite and Meta-world. PEBBLE outperforms other preference-based RL methods, like Preference PPO, in complex settings, showcasing the ability to train agents more efficiently in terms of both feedback and computational load. The results indicate that, with a set number of feedback queries, PEBBLE can achieve performance levels comparable to those of canonical algorithms like SAC and PPO, where direct reward functions are accessible.

On simpler tasks, the introduction of pre-training improves both sample- and feedback-efficiency, highlighting the utility of diverse experience generation prior to formal feedback sessions. The paper also includes human experiments, where PEBBLE effectively instructs agents to adopt novel behaviors not predefined in task specifications, exemplifying its applicability beyond standard benchmarks.

The paper speculates on broader implications of using PEBBLE in real-world applications, particularly in robotics, where designing reward functions can be an arduous task. The algorithm's capacity to derive meaningful behaviors with limited human input could facilitate a variety of applications ranging from industrial robotics to more complex human-interactive systems. Future work could explore extending PEBBLE's methodology to more intricate real-world environments or implementing more sophisticated unsupervised pre-training techniques to further optimize feedback and sample efficiency. The integration of advanced human-machine interface systems may also enhance the practical deployment of interactive reinforcement learning solutions within diverse operational contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Kimin Lee (69 papers)
Laura Smith (20 papers)
Pieter Abbeel (372 papers)

Citations (237)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos