Improving Policy Gradient by Exploring Under-appreciated Rewards (1611.09321v3)

Published 28 Nov 2016 in cs.LG and cs.AI

Abstract: This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

Authors (3)

Ofir Nachum (64 papers)
Mohammad Norouzi (81 papers)
Dale Schuurmans (112 papers)

Citations (40)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Improving Policy Gradient by Exploring Under-appreciated Rewards (1611.09321v3)

Summary

Related Papers