On Learning Intrinsic Rewards for Policy Gradient Methods (1804.06459v2)

Published 17 Apr 2018 in cs.AI, cs.LG, and stat.ML

Abstract: In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et.al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.

PDF Abstract

An Analysis of "On Learning Intrinsic Rewards for Policy Gradient Methods"

In the paper entitled "On Learning Intrinsic Rewards for Policy Gradient Methods," the authors address a seminal issue in reinforcement learning (RL): the complexity of crafting reward functions to drive efficient learning in RL agents. This complexity is particularly pronounced when a reward function is not inherent to the task or environment, necessitating a design process that can often be iterative and error-prone. The authors build upon the Optimal Reward Framework of Singh et al., which proposes the concept of intrinsically motivated rewards to complement an agent's extrinsic goals.

Reinforcement learning (RL) agent performance is heavily dependent on the reward signals they utilize to learn from their environment. Standard practice involves relying on extrinsic rewards—the primary task-oriented signals. However, these are often sparse or difficult to define in complex tasks, leading to inefficient learning. As an answer to this challenge, the authors introduce a novel stochastic-gradient-based algorithm designed to learn intrinsic rewards that, when combined with extrinsic rewards, enhance the performance of policy gradient-based RL methodologies.

Core Contributions and Algorithmic Innovations

The authors contribute a key algorithm, the Learning Intrinsic Rewards for Policy Gradient (LIRPG), that implements their intrinsic reward learning methodology into policy gradient methods, namely A2C and PPO. The LIRPG algorithm operates by optimizing the intrinsic rewards such that their combination with extrinsic rewards optimizes the policy learning in the observed environment. Notably, LIRPG updates the policy parameters to optimize the sum of intrinsic and extrinsic rewards, while concurrently adapting the intrinsic reward parameters to optimize the extrinsic rewards independently. This decoupling ensures that the intrinsic rewards contribute to the core task performance rather than distract from it.

Experimental Evaluation and Results

In empirical assessments, LIRPG is evaluated across several domains, including Atari video games and Mujoco control tasks, demonstrating notable improvements over baseline RL algorithms reliant on extrinsic rewards alone. In 9 out of 15 Atari games, LIRPG augmented agents exhibited over a 10% improvement compared to standard A2C agents. Additionally, in Mujoco domains, LIRPG outperformed the PPO baseline agents significantly in 4 out of 5 tasks when extrinsic rewards were sparse due to delayed feedback. These results illustrate the potential of LIRPG in accelerating learning processes and achieving higher task performance, particularly in environments with complex, delayed, or sparse reward structures.

Theoretical and Practical Implications

The research presented expands the theoretical landscape by illustrating the potential for optimal intrinsic rewards to mitigate the practical limitations of bounded RL agents, such as limited computational and representational capacities. Intrinsically motivated RL—by learning auxiliary rewards that complement intrinsic agent limitations—poses a promising pathway to more efficient and scalable RL systems.

Future Research Directions

The paper leaves significant space for future exploration, such as further robustness tests of the LIRPG algorithm across additional RL environments, particularly in transfer learning setups where intrinsic motivations may need to be re-optimized rapidly for new tasks. Another promising avenue includes exploring scenarios with multi-agent environments where agents' interactions may necessitate more complex intrinsic reward structures.

In sum, "On Learning Intrinsic Rewards for Policy Gradient Methods" provides a substantive step forward in the design and utilization of intrinsic rewards in RL, offering a solid foundation for future research aimed at overcoming intrinsic challenges posed by complex, real-world environments.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Zeyu Zheng (60 papers)
Junhyuk Oh (27 papers)
Satinder Singh (80 papers)

Citations (192)

View on Semantic Scholar