Reinforcement Learning from Imperfect Demonstrations (1802.05313v2)

Published 14 Feb 2018 in cs.AI, cs.LG, and stat.ML

Abstract: Robust real-world learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator's performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.

Authors (6)

Yang Gao (761 papers)
Huazhe Xu (93 papers)
Ji Lin (47 papers)
Fisher Yu (104 papers)
Sergey Levine (531 papers)
Trevor Darrell (324 papers)

Citations (195)

View on Semantic Scholar

Summary

Overview of Reinforcement Learning from Imperfect Demonstrations

Reinforcement Learning (RL) is a machine learning approach that allows agents to learn how to make decisions by interacting with an environment. Traditional RL methods often rely on vast amounts of interaction data to improve performance, which can be problematic when initial attempts at decision-making need to be improved substantially. One way to circumvent this issue is to incorporate learning from demonstrations, where an algorithm can 'observe' expert behavior before or during its training process.

However, challenges arise when these demonstrations contain noise or are suboptimal. This calls for algorithms that can handle imperfect demonstrations effectively.

Normalized Actor-Critic (NAC)

Enter the Normalized Actor-Critic (NAC) method developed to address the shortcomings of existing approaches that combine supervised learning on the demonstration data with RL to further improve based on environmental feedback. Current methods often optimize disparate losses and can be particularly sensitive to noisy demonstrations.

The NAC, however, proposes a unified reinforcement learning algorithm that effectively normalizes the Q-function—reducing the Q-values of actions unseen in demonstration data. It learns an initial policy network from demonstrations and refines this policy in the environment, aiming to surpass the performance of the demonstrator. Notably, both the learning from demonstration and the interactive refinement processes use the same objective, unlike prior methods that rely on distinct supervised and RL losses.

Uniqueness of NAC

The unified nature of the NAC method makes it robust to suboptimal demonstration data. It isn't prescriptive about needing to mimic all examples in the dataset, which is an advantage over other methods that are more sensitive to demonstration quality. Additionally, the NAC algorithm can purportedly learn from demonstrations that contain a mix of state, action, and reward signals that are imperfect or even adversarial. This is because it makes no assumption about the optimality of the data, a major leap in learning from demonstrations.

Experimental Results

The researchers validated the proposed methodology in various environments, including a toy Minecraft Game and two realistic 3D simulated driving games—Torcs and Grand Theft Auto V. The findings are compelling: their unified reinforcement learning algorithm demonstrated robust learning capabilities and outperformed existing baselines when evaluated.

The key contributions of this paper include the development of the NAC method and its practical advantages in numerous environments. It proposes a unified objective that is capable of handling both demonstrations and environments effectively and does not use explicit supervised imitation loss. Moreover, it’s shown to be robust to noisy demonstrations, setting it apart from methods that heavily rely on supervised learning.

Conclusion

In summary, the Normalized Actor-Critic method introduced in this paper offers an innovative solution to learning from imperfect demonstrations, which challenges traditional RL approaches. It bridges the gap between demonstration data and environmental interaction and achieves robust performance even when faced with suboptimal or noisy input data. This positions NAC as a potent tool for real-world applications where less-than-perfect data is a given.

PDF Markdown