Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations (1904.06387v5)

Published 12 Apr 2019 in cs.LG and stat.ML

Abstract: A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

Citations (324)

View on Semantic Scholar

Summary

The paper introduces T-REX, which extrapolates reward functions from ranked suboptimal trajectories to achieve policies that exceed demonstrator performance.
It employs a neural network trained with cross-entropy loss to rank trajectories, enabling effective policy optimization in high-dimensional tasks.
Empirical results on Atari and MuJoCo show T-REX can more than double demonstrator performance, demonstrating robustness even with ranking noise.

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

The paper introduces Trajectory-ranked Reward EXtrapolation (T-REX), a novel technique for extrapolating reward functions from sequences of ranked demonstrations in inverse reinforcement learning (IRL). Traditional IRL methods often struggle to outperform demonstrators, as they aim to learn a reward function that justifies the demonstrator's behavior as optimal. In contrast, T-REX aims to address this limitation by leveraging ranked trajectories to infer reward functions that potentially lead to superior policies than those demonstrated.

The motivation for T-REX stems from the observation that demonstrators — particularly non-experts or when performing complex tasks — often provide suboptimal demonstrations. In high-dimensional tasks such as Atari games and MuJoCo robotics simulations, designing explicit reward functions or providing optimal demonstrations can be challenging. T-REX is designed to use ranked suboptimal demonstrations to extrapolate the demonstrator's intention and learn a reward function guiding better-than-demonstrator policies.

T-REX employs a neural network to model the reward function, using supervised learning from ranked trajectory pairs. The setup uses a cross-entropy loss to train the network to prefer trajectories with higher ranks, effectively differentiating between suboptimal and more optimal behaviors. This learned reward function is then used in tandem with deep reinforcement learning to optimize a policy that maximizes the newly inferred reward structure.

The efficacy of this approach is empirically validated using standard benchmarks such as various tasks in Atari and MuJoCo. The results demonstrate that T-REX consistently achieves policies that significantly outperform both the provided demonstrations and those achieved through state-of-the-art imitation learning approaches like Behavioral Cloning from Observations (BCO) and Generative Adversarial Imitation Learning (GAIL). Notably, T-REX policies often surpass the best demonstration by more than double in terms of performance metrics on several tasks. This suggests that T-REX successfully extrapolates the demonstrator's intended objectives to areas of the state space not explicitly covered by the demonstrations observed.

An interesting property of T-REX is its robustness to ranking noise, as shown in experiments where artificially induced noise only modestly degraded performance. Moreover, the method proves capable of learning effective policies from automatically inferred rankings based on the temporal sequence of demonstrations without explicit performance labeling.

The theoretical implications of this work suggest that rewarding progression in suboptimal trajectories can reveal implicit intentions effectively, even in the absence of perfect demonstrations. Practically, the technique facilitates applications in domains with high-dimensional inputs and variably skilled human operators by eliminating the need for optimal demonstrations and explicit reward designs.

Future research could expand on these findings by further exploring the theoretical underpinnings of reward extrapolation and its relation to generalization across diverse tasks. Extensions to tasks requiring more nuanced reward structures, such as those involving multiple objectives or intricate temporal dependencies, represent another promising direction. Additionally, integrating T-REX with interactive learning systems that dynamically refine rankings based on feedback could enhance adaptability and performance in deploying real-world autonomous agents.

PDF Markdown

Related Papers

GitHub

GitHub - hiwonjoon/ICML2019-TREX (83 stars)

Tweets

https://twitter.com/ashrewards/status/1785952286033744265

YouTube

Show All Videos