FeUdal Networks for Hierarchical Reinforcement Learning (1703.01161v2)

Published 3 Mar 2017 in cs.AI

Abstract: We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels -- allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits -- in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D DeepMind Lab environment.

Citations (857)

View on Semantic Scholar

Summary

The paper presents FeUdal Networks that decouple goal-setting from execution using a Manager and Worker to tackle long-term credit assignment.
The Manager employs a dilated LSTM with a novel Transition Policy Gradient to set directional goals in a latent state space.
Experimental results on Atari and DeepMind Lab demonstrate FuN’s superior performance and transfer learning capabilities over standard RL baselines.

This paper introduces FeUdal Networks (FuN), a hierarchical reinforcement learning architecture designed to tackle long-term credit assignment and memory challenges in complex environments. FuN consists of two main modules: a Manager and a Worker, operating at different temporal resolutions.

Core Architecture and Mechanism:

Manager: Operates at a lower temporal resolution (e.g., makes decisions or sets goals every c steps). Its role is to learn and set abstract goals for the Worker.
- It processes observations via a shared perceptual module (f^percept) to get z_t.
- It computes its own latent state representation s_t = f^Mspace(z_t).
- It uses a recurrent network (f^Mrnn), specifically a dilated LSTM (dLSTM), to process s_t and output a raw goal vector \hat{g}_t.
- The final goal g_t is the normalized version: $g_t=\hat{g}_t/||\hat{g}_t||$ . This normalization enforces that goals represent directions in the latent state space s.
Worker: Operates at a higher temporal resolution (every environment tick). Its role is to execute low-level actions to achieve the goals set by the Manager.
- It also receives the shared perceptual representation z_t.
- It uses its own recurrent network (f^Wrnn), typically a standard LSTM.
- It receives the Manager's goals, pooled over the last c steps and linearly embedded: $w_t = \phi(\sum_{i=t-c}^t {g_i})$ . $\phi$ is a learned linear projection without bias.
- The Worker's LSTM output U_t (representing action embeddings) is combined with the goal embedding w_t via a multiplicative interaction to produce the final action policy: $\pi_t = SoftMax(U_t w_t)$ .

Training and Learning:

FuN employs decoupled training objectives for the Manager and Worker, which is a key aspect of its design:

Manager Training (Transition Policy Gradient - TPG): The Manager is trained to predict advantageous directions of state change over a horizon c. It learns by maximizing the cosine similarity between its proposed goal direction g_t and the actual observed state change direction s_{t+c} - s_t, weighted by the Manager's advantage function $A^M_t$ . The gradient update is approximately:

$\nabla g_t = A^M_t \nabla_{\theta} d_{\cos}(s_{t+c} - s_t, g_t(\theta))$

This update is derived as an approximation to policy gradients applied directly to the state transitions induced by the sub-policies (goals), assuming the Worker's transitions follow a von Mises-Fisher distribution around the goal direction. The Manager's learning signal comes solely from the environment reward R_t (via $A^M_t$ ).
Worker Training (Intrinsic + Extrinsic Reward): The Worker is trained using a standard RL algorithm (like A3C) to maximize a combination of the environment reward R_t and an intrinsic reward $R^I_t$ . The intrinsic reward encourages the Worker to follow the Manager's goals:

$r^I_t = \frac{1}{c} \sum_{i=1}^c d_{\cos}(s_{t} - s_{t-i}, g_{t-i})$

This rewards the Worker for moving in the direction specified by the Manager's recent goals. The total objective for the Worker is $R_t + \alpha R^I_t$ , where $\alpha$ is a hyperparameter balancing intrinsic and extrinsic motivation.

Key Innovations and Implementation Details:

Hierarchical Structure: Explicit separation of goal-setting (Manager) and goal-achieving (Worker) at different time scales.
Directional Goals: Goals g_t are unit vectors representing directions in a learned latent space s_t, promoting generalization compared to absolute goals.
Decoupled Learning: Manager learns from extrinsic reward via TPG; Worker learns from intrinsic + extrinsic reward via standard RL. No gradients flow directly from Worker actions to Manager goals, giving goals semantic meaning related to state transitions.
Transition Policy Gradient (TPG): A novel update rule for the Manager that bypasses the need to differentiate through the Worker's policy, directly linking goals to desirable state transitions.
Dilated LSTM (dLSTM): A custom RNN for the Manager. It has r internal LSTM cores. At time step t, only core t % r is updated: $\hat{h}^{t\%r}_t, g_t = LSTM(s_t, \hat{h}^{t\%r}_{t-1})$ . This allows different parts of the state to retain information over longer timescales (effectively r times longer) while still processing input at every step. r=10 and c=10 were used in experiments.
Goal Embedding: Goals influence the Worker's policy via pooling (sum over c steps) and multiplicative interaction ( $U_t w_t$ ), ensuring smooth conditioning. The linear embedding $\phi$ has no bias, preventing the Worker from ignoring the Manager.

Experimental Results and Applications:

Atari Games: FuN significantly outperformed a strong LSTM baseline (A3C) on challenging Atari games requiring long-term planning and memory, most notably Montezuma's Revenge, where it achieved much higher scores and learned faster. It also showed strong performance on Ms. Pacman, Amidar, Gravitar, Enduro, and Frostbite. Visualizations showed FuN learned interpretable sub-goals (waypoints in Montezuma's Revenge) and distinct sub-policies (spatial navigation in Seaquest).
DeepMind Lab: FuN outperformed the LSTM baseline on 3D visual memorisation tasks (Water maze, T-maze, Non-match), demonstrating its effectiveness in environments requiring memory and spatial reasoning.
Ablation Studies: Confirmed the importance of key components: TPG was superior to alternatives (end-to-end gradients, standard PG for goals, absolute goals); dLSTM was crucial for Manager performance; directional goals outperformed absolute goals; intrinsic reward ( $\alpha > 0$ ) was generally beneficial.
Transfer Learning: Showed potential for transfer learning by successfully transferring a FuN agent trained with action repeat 4 to an environment with action repeat 1, outperforming baselines trained from scratch or transferred similarly. This suggests the learned hierarchical structure (especially the Manager's transition policy) can generalize across changes in low-level dynamics.

Practical Considerations:

Implementation: Requires implementing the dLSTM, the TPG update rule for the Manager, the intrinsic reward calculation for the Worker, and managing the two separate training loops within an RL framework like A3C.
Computational Cost: Likely higher than a flat LSTM baseline due to the two RNNs and potentially longer BPTT unrolls (e.g., 400 steps for FuN vs. 40/100 for LSTM in experiments). However, it may be more sample efficient on complex tasks.
Hyperparameters: Introduces new hyperparameters like the goal horizon c, dLSTM dilation r, goal embedding dimension k, and intrinsic reward weight alpha, which may require tuning.
Applicability: Well-suited for tasks with sparse rewards, long time horizons, requirements for temporal abstraction, or where hierarchical decomposition is natural (e.g., complex navigation, manipulation, strategy games). It might be less beneficial or overly complex for purely reactive tasks where long-term planning is not critical.

PDF Markdown

FeUdal Networks for Hierarchical Reinforcement Learning (1703.01161v2)

Summary

Related Papers