What Can Learned Intrinsic Rewards Capture? (1912.05500v3)

Published 11 Dec 2019 in cs.AI and cs.LG

Abstract: The objective of a reinforcement learning agent is to behave so as to maximise the sum of a suitable scalar function of state: the reward. These rewards are typically given and immutable. In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge. To investigate this, we propose a scalable meta-gradient framework for learning useful intrinsic reward functions across multiple lifetimes of experience. Through several proof-of-concept experiments, we show that it is feasible to learn and capture knowledge about long-term exploration and exploitation into a reward function. Furthermore, we show that unlike policy transfer methods that capture "how" the agent should behave, the learned reward functions can generalise to other kinds of agents and to changes in the dynamics of the environment by capturing "what" the agent should strive to do.

PDF Abstract

Analysis of "What Can Learned Intrinsic Rewards Capture?"

This paper presents a novel exploration into reinforcement learning (RL) by shifting the traditional focus from fixed reward functions to learned intrinsic rewards. It proposes that the reward function itself can become a locus of learned knowledge, adaptable across multiple environments and agent types, which is distinct from the conventional approach where rewards are immutable and predetermined.

Summary of Key Contributions

Intrinsic Reward Learning Framework: The authors introduce a scalable meta-gradient framework designed to learn intrinsic rewards over multiple lifetimes. This framework facilitates the capture of both exploratory and exploitative knowledge within the reward function, leveraging the dynamics of reinforcement learning environments.
Meta-Gradient Descent Implementation: They develop a gradient-based approach, which improves upon the exhaustive search methods historically used to identify optimal reward functions. This technique allows for the reward function to be parameterized by a recurrent neural network, thereby capturing complex dependencies across an agent's entire lifetime.
Exploration and Exploitation Balance: Through a series of controlled experiments, the paper showcases the ability of learned intrinsic rewards to balance long-term exploration (such as seeking novel states) and effective exploitation (like capitalizing on known rewarding states) across episodes, something not typically possible with static reward structures.
Generalization Across Agents and Environments: A noteworthy contribution is the demonstration that these learned rewards can adapt across different types of learning agents, including those with different action semantics or learning algorithms. This adaptability contrasts with policy transfer methods which typically encode "how" behaviors should be executed and do not adapt well under changes in the agent-environment interaction.

Detailed Analysis of Numerical Results

The experiments conducted in various grid-world domains differentiated the efficacy of learned intrinsic rewards versus traditional extrinsic reward methods. In all test scenarios, agents utilizing learned intrinsic rewards displayed more proficient long-term task performance, particularly in domains requiring significant exploration across episodes or adaptation to non-stationary conditions (e.g., changing goal locations or reward structures).

Lifetime Return as a Metric: The research emphasizes using lifetime return instead of episodic return as the optimization objective, which proved crucial in capturing comprehensive behaviors across lifetimes, even promoting exploratory actions that are penalized in the short term but beneficial overall.
Generalization Capabilities: Remarkably, the research demonstrates that intrinsic rewards can generalize "what" knowledge across different learning frameworks. This represents a significant finding, contrasting with conventional knowledge transfer methods like RL² and MAML, which are less robust to changes in action space semantics or learning schemes.

Theoretical and Practical Implications

Theoretically, this work suggests a paradigm shift in RL, illustrating that capturing task knowledge at the level of reward function can be more versatile than traditional policy-binding methods. Practically, it proposes a framework that can be applied to a wide array of RL problems where environment dynamics vary or adapt over time, highlighting the potential for intrinsic rewards to be tuned automatically without human intervention.

Future Directions

Several avenues for future research are hinted at by this paper. There is potential for further refinement in the scalability of the proposed framework, perhaps through integration with alternative meta-learning architectures or exploration of the application in larger and more dynamic environments. Moreover, understanding the interplay between various loci of knowledge within an RL agent (policies, value functions, and reward functions) presents an intriguing challenge for developing more sophisticated and adaptable RL systems.

In conclusion, this paper provides a comprehensive exploration of intrinsic rewards, revealing their capacity to encapsulate rich, adaptable knowledge within RL frameworks and challenging traditional views that primarily leverage fixed external rewards for agent training.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Zeyu Zheng (60 papers)
Junhyuk Oh (27 papers)
Matteo Hessel (28 papers)
Zhongwen Xu (33 papers)
Manuel Kroiss (5 papers)
Hado van Hasselt (57 papers)
David Silver (67 papers)
Satinder Singh (80 papers)

Citations (74)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos