- The paper introduces T-REX, which extrapolates reward functions from ranked suboptimal trajectories to achieve policies that exceed demonstrator performance.
- It employs a neural network trained with cross-entropy loss to rank trajectories, enabling effective policy optimization in high-dimensional tasks.
- Empirical results on Atari and MuJoCo show T-REX can more than double demonstrator performance, demonstrating robustness even with ranking noise.
Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
The paper introduces Trajectory-ranked Reward EXtrapolation (T-REX), a novel technique for extrapolating reward functions from sequences of ranked demonstrations in inverse reinforcement learning (IRL). Traditional IRL methods often struggle to outperform demonstrators, as they aim to learn a reward function that justifies the demonstrator's behavior as optimal. In contrast, T-REX aims to address this limitation by leveraging ranked trajectories to infer reward functions that potentially lead to superior policies than those demonstrated.
The motivation for T-REX stems from the observation that demonstrators — particularly non-experts or when performing complex tasks — often provide suboptimal demonstrations. In high-dimensional tasks such as Atari games and MuJoCo robotics simulations, designing explicit reward functions or providing optimal demonstrations can be challenging. T-REX is designed to use ranked suboptimal demonstrations to extrapolate the demonstrator's intention and learn a reward function guiding better-than-demonstrator policies.
T-REX employs a neural network to model the reward function, using supervised learning from ranked trajectory pairs. The setup uses a cross-entropy loss to train the network to prefer trajectories with higher ranks, effectively differentiating between suboptimal and more optimal behaviors. This learned reward function is then used in tandem with deep reinforcement learning to optimize a policy that maximizes the newly inferred reward structure.
The efficacy of this approach is empirically validated using standard benchmarks such as various tasks in Atari and MuJoCo. The results demonstrate that T-REX consistently achieves policies that significantly outperform both the provided demonstrations and those achieved through state-of-the-art imitation learning approaches like Behavioral Cloning from Observations (BCO) and Generative Adversarial Imitation Learning (GAIL). Notably, T-REX policies often surpass the best demonstration by more than double in terms of performance metrics on several tasks. This suggests that T-REX successfully extrapolates the demonstrator's intended objectives to areas of the state space not explicitly covered by the demonstrations observed.
An interesting property of T-REX is its robustness to ranking noise, as shown in experiments where artificially induced noise only modestly degraded performance. Moreover, the method proves capable of learning effective policies from automatically inferred rankings based on the temporal sequence of demonstrations without explicit performance labeling.
The theoretical implications of this work suggest that rewarding progression in suboptimal trajectories can reveal implicit intentions effectively, even in the absence of perfect demonstrations. Practically, the technique facilitates applications in domains with high-dimensional inputs and variably skilled human operators by eliminating the need for optimal demonstrations and explicit reward designs.
Future research could expand on these findings by further exploring the theoretical underpinnings of reward extrapolation and its relation to generalization across diverse tasks. Extensions to tasks requiring more nuanced reward structures, such as those involving multiple objectives or intricate temporal dependencies, represent another promising direction. Additionally, integrating T-REX with interactive learning systems that dynamically refine rankings based on feedback could enhance adaptability and performance in deploying real-world autonomous agents.