Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation (1604.06057v2)

Published 20 Apr 2016 in cs.LG, cs.AI, cs.CV, cs.NE, and stat.ML

Abstract: Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete stochastic decision process, and (2) the classic ATARI game `Montezuma's Revenge'.

Authors (4)

Tejas D. Kulkarni (8 papers)
Ardavan Saeedi (15 papers)
Joshua B. Tenenbaum (257 papers)
Karthik R. Narasimhan (4 papers)

Citations (1,084)

View on Semantic Scholar

Summary

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

The paper "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation" by Kulkarni et al. introduces hierarchical-DQN (h-DQN), a novel framework that synergizes hierarchical value functions with intrinsically motivated deep reinforcement learning. This work addresses the significant challenge of learning goal-directed behavior in environments characterized by sparse feedback. Standard RL approaches often falter in such scenarios due to insufficient exploration, leading to suboptimal policies and poor value function estimates.

Overview of h-DQN

The h-DQN framework integrates two critical components: hierarchical value functions operating over different temporal scales and intrinsic motivation to encourage robust exploratory behavior. This hierarchical structure consists of a meta-controller and a controller, operating at high and low temporal resolutions, respectively. The meta-controller selects high-level goals, and the controller executes atomic actions to achieve these goals, each contributing a distinct value function.

Methodology

The hierarchical decomposition of the value function allows better exploration by leveraging intrinsic goals. These intrinsic goals are parameterized over entities and their relations within the environment, facilitating efficient exploration and mitigating the sparse feedback problem. The hierarchical setup allows the agent to construct and pursue subgoals, thereby improving data efficiency and policy learning.

Controller and Meta-Controller

Controller: Operates at a lower temporal scale, selecting actions based on the current state and the goal. It learns to satisfy goals through intrinsic rewards, optimizing the value function $Q_1$ using deep Q-learning techniques.
Meta-Controller: Operates at a higher temporal scale, selecting goals based on the state. It receives extrinsic rewards from the environment and optimizes the value function $Q_2$ .

The training involves a two-phase process: initially training the controller independently to familiarize it with achieving individual goals, followed by joint training of both the controller and meta-controller to refine the overall policy.

Numerical Results and Findings

The paper corroborates the efficacy of h-DQN through experiments in two domains: a discrete stochastic decision process and the ATARI game Montezuma's Revenge.

Discrete Stochastic Decision Process: Here, h-DQN significantly outperforms standard Q-learning by efficiently exploring the state space and consistently achieving higher average rewards. This indicates that intrinsic goals facilitate better exploration, critical for complex decision processes where the reward depends on the history of states visited.
Montezuma's Revenge: Demonstrating the applicability of h-DQN in a high-dimensional and sparse-reward environment, the agent successfully learns to navigate towards long-term goals such as picking up the key and opening the door, which traditional DQN models fail to achieve. The meta-controller's choice of meaningful intrinsic goals guides the exploration in this challenging environment.

Implications and Future Directions

The hierarchical nature of h-DQN offers a structured approach to handle tasks requiring long-range planning, effectively addressing the exploration-exploitation trade-off in sparse reward settings. This methodology not only enhances the agent's ability to learn complex tasks but also demonstrates the potential of combining deep RL with cognitive and developmental insights, such as the use of entities and relations for goal parameterization.

Future work could enhance h-DQN by incorporating the following:

Object Detection and Representation: Integration of advanced object detection and representation techniques directly from raw pixels to automate the identification of entities and relations used in goal specification.
Memory Augmentation: Incorporation of recurrent neural networks and episodic memory modules to handle non-Markovian settings and longer-range dependencies.
Generalization to Other Domains: Extending the framework to other RL environments and tasks, potentially involving more sophisticated forms of intrinsic motivation and hierarchical decomposition.

The h-DQN framework, with its hierarchical and intrinsically motivated approach, presents a significant advancement in the field of deep reinforcement learning, paving the way for more sophisticated AI agents capable of tackling complex real-world challenges.

Conclusion

Kulkarni et al.'s contribution with h-DQN represents a substantial step forward in reinforcement learning by effectively integrating hierarchical temporal abstraction and intrinsic motivation, offering a promising direction for future research and applications in AI.

PDF Markdown

Related Papers

YouTube

Show All Videos