Latent Space Policies for Hierarchical Reinforcement Learning (1804.02808v2)

Published 9 Apr 2018 in cs.LG, cs.AI, and stat.ML

Abstract: We address the problem of learning hierarchical deep neural network policies for reinforcement learning. In contrast to methods that explicitly restrict or cripple lower layers of a hierarchy to force them to use higher-level modulating signals, each layer in our framework is trained to directly solve the task, but acquires a range of diverse strategies via a maximum entropy reinforcement learning objective. Each layer is also augmented with latent random variables, which are sampled from a prior distribution during the training of that layer. The maximum entropy objective causes these latent variables to be incorporated into the layer's policy, and the higher level layer can directly control the behavior of the lower layer through this latent space. Furthermore, by constraining the mapping from latent variables to actions to be invertible, higher layers retain full expressivity: neither the higher layers nor the lower layers are constrained in their behavior. Our experimental evaluation demonstrates that we can improve on the performance of single-layer policies on standard benchmark tasks simply by adding additional layers, and that our method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.

Authors (4)

Tuomas Haarnoja (16 papers)
Kristian Hartikainen (10 papers)
Pieter Abbeel (372 papers)
Sergey Levine (531 papers)

Citations (186)

View on Semantic Scholar

Summary

The paper's main contribution is a novel framework that uses latent space policies to enable unconstrained training of hierarchical RL layers.
It employs a maximum entropy objective with latent random variables to promote diverse strategies and enhanced exploration.
Experimental results validate the method's efficiency by achieving state-of-the-art performance on continuous control tasks, including a humanoid robot with 21 actuators.

Latent Space Policies for Hierarchical Reinforcement Learning

In the paper "Latent Space Policies for Hierarchical Reinforcement Learning," the authors Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine introduce a novel framework for developing hierarchical reinforcement learning (RL) strategies using latent space policies (LSPs). This approach contrasts with prior methods that often enforce constraints on lower layers to exhibit hierarchical behaviors by imposing modulating signals from higher layers. Instead, the proposed framework advocates for separately training each layer to tackle a given task via a maximum entropy reinforcement learning objective, all the while encouraging the acquisition of diverse strategic behaviors.

The proposed framework employs latent random variables within each layer of a deep neural network, sampled during training from a prior distribution. By maintaining an unconstrained invertible mapping from latent variables to actions, higher layers retain full expressivity without being bottled-necked by lower-level limitations. These latent random variables become part of the policy itself, allowing higher layers to directly influence the behavior of lower layers through this latent space. The novelty of this work lies in enabling multiple layers in a feedforward manner to use their own latent space as the action space for the layer above. Such an approach ostensibly permits each layer to independently tackle the problem and aid higher layers when full resolution is not feasible.

Through experimental results, the paper demonstrates robust improvement over single-layer policies on traditional continuous control benchmark tasks. Notably, it showcases that additional hierarchical layers can effectively resolve complex sparse-reward tasks by adopting simplified lower-level objectives. For instance, a humanoid robot with 21 actuators achieved state-of-the-art results in various environments, illustrating significant gains in sample efficiency and task complexity resolution.

The methodology builds upon a probabilistic graphical model-based understanding of reinforcement learning, embedding it within a maximum entropy framework to ensure richer exploration and diversity. The approach seeks equilibrium between reward maximization and stochastic policy entropy, correcting limitations from both earlier model constraints that induced hierarchical emergence through enforced limitations and earlier RL methods.

Critically, the hierarchical design proposed addresses long-standing challenges in hierarchical policy construction: automating the hierarchy building process without compromising task effectiveness or generality. The authors assert that their method allows constructing operationally effective hierarchies "bottom-up," whereby training occurs iteratively, training one latent space policy at a time. Each latent variable policy remains unrestrained, permitting direct attempts at individual task resolution, further optimized by subsequent policy layers.

This paper's contribution is multifaceted. It provides a stable algorithm for integrating latent variable training into reinforcement learning, allowing both the separation of layers and end-to-end learners. Structurally, all hierarchy layers share a uniform policy framework, permitting uniformity in training mechanisms and expressivity across layers. From a practical RL standpoint, this approach alleviates the manual burden of hierarchy design, commonly required in earlier frameworks.

By hypothesizing that specialized sub-policies can expedite learning for complex decision-making, the authors provide an analytical and empirical foundation for the advancement of multi-layered reinforcement learning frameworks. Looking forward, the potential theoretical and practical implications for scalable AI systems are broad, encompassing improved modularization in RL and facilitating the emulation of higher-order cognition in artificial systems. As such, continued exploration of temporal abstraction, latency, and expressivity parameters within broader contexts could yield enhanced optimization strategies in future intelligent systems.

PDF Markdown

Related Papers

YouTube

Show All Videos