- The paper presents FeUdal Networks that decouple goal-setting from execution using a Manager and Worker to tackle long-term credit assignment.
- The Manager employs a dilated LSTM with a novel Transition Policy Gradient to set directional goals in a latent state space.
- Experimental results on Atari and DeepMind Lab demonstrate FuN’s superior performance and transfer learning capabilities over standard RL baselines.
This paper introduces FeUdal Networks (FuN), a hierarchical reinforcement learning architecture designed to tackle long-term credit assignment and memory challenges in complex environments. FuN consists of two main modules: a Manager and a Worker, operating at different temporal resolutions.
Core Architecture and Mechanism:
- Manager: Operates at a lower temporal resolution (e.g., makes decisions or sets goals every
c
steps). Its role is to learn and set abstract goals for the Worker.
- It processes observations via a shared perceptual module (
f^percept
) to get z_t
.
- It computes its own latent state representation
s_t = f^Mspace(z_t)
.
- It uses a recurrent network (
f^Mrnn
), specifically a dilated LSTM (dLSTM), to process s_t
and output a raw goal vector \hat{g}_t
.
- The final goal
g_t
is the normalized version: gt=g^t/∣∣g^t∣∣. This normalization enforces that goals represent directions in the latent state space s
.
- Worker: Operates at a higher temporal resolution (every environment tick). Its role is to execute low-level actions to achieve the goals set by the Manager.
- It also receives the shared perceptual representation
z_t
.
- It uses its own recurrent network (
f^Wrnn
), typically a standard LSTM.
- It receives the Manager's goals, pooled over the last
c
steps and linearly embedded: wt=ϕ(∑i=t−ctgi). ϕ is a learned linear projection without bias.
- The Worker's LSTM output
U_t
(representing action embeddings) is combined with the goal embedding w_t
via a multiplicative interaction to produce the final action policy: πt=SoftMax(Utwt).
Training and Learning:
FuN employs decoupled training objectives for the Manager and Worker, which is a key aspect of its design:
- Manager Training (Transition Policy Gradient - TPG): The Manager is trained to predict advantageous directions of state change over a horizon
c
. It learns by maximizing the cosine similarity between its proposed goal direction g_t
and the actual observed state change direction s_{t+c} - s_t
, weighted by the Manager's advantage function AtM. The gradient update is approximately:
∇gt=AtM∇θdcos(st+c−st,gt(θ))
This update is derived as an approximation to policy gradients applied directly to the state transitions induced by the sub-policies (goals), assuming the Worker's transitions follow a von Mises-Fisher distribution around the goal direction. The Manager's learning signal comes solely from the environment reward R_t
(via AtM).
- Worker Training (Intrinsic + Extrinsic Reward): The Worker is trained using a standard RL algorithm (like A3C) to maximize a combination of the environment reward
R_t
and an intrinsic reward RtI. The intrinsic reward encourages the Worker to follow the Manager's goals:
rtI=c1i=1∑cdcos(st−st−i,gt−i)
This rewards the Worker for moving in the direction specified by the Manager's recent goals. The total objective for the Worker is Rt+αRtI, where α is a hyperparameter balancing intrinsic and extrinsic motivation.
Key Innovations and Implementation Details:
- Hierarchical Structure: Explicit separation of goal-setting (Manager) and goal-achieving (Worker) at different time scales.
- Directional Goals: Goals
g_t
are unit vectors representing directions in a learned latent space s_t
, promoting generalization compared to absolute goals.
- Decoupled Learning: Manager learns from extrinsic reward via TPG; Worker learns from intrinsic + extrinsic reward via standard RL. No gradients flow directly from Worker actions to Manager goals, giving goals semantic meaning related to state transitions.
- Transition Policy Gradient (TPG): A novel update rule for the Manager that bypasses the need to differentiate through the Worker's policy, directly linking goals to desirable state transitions.
- Dilated LSTM (dLSTM): A custom RNN for the Manager. It has
r
internal LSTM cores. At time step t
, only core t % r
is updated: h^tt%r,gt=LSTM(st,h^t−1t%r). This allows different parts of the state to retain information over longer timescales (effectively r
times longer) while still processing input at every step. r=10
and c=10
were used in experiments.
- Goal Embedding: Goals influence the Worker's policy via pooling (
sum
over c
steps) and multiplicative interaction (Utwt), ensuring smooth conditioning. The linear embedding ϕ has no bias, preventing the Worker from ignoring the Manager.
Experimental Results and Applications:
- Atari Games: FuN significantly outperformed a strong LSTM baseline (A3C) on challenging Atari games requiring long-term planning and memory, most notably Montezuma's Revenge, where it achieved much higher scores and learned faster. It also showed strong performance on Ms. Pacman, Amidar, Gravitar, Enduro, and Frostbite. Visualizations showed FuN learned interpretable sub-goals (waypoints in Montezuma's Revenge) and distinct sub-policies (spatial navigation in Seaquest).
- DeepMind Lab: FuN outperformed the LSTM baseline on 3D visual memorisation tasks (Water maze, T-maze, Non-match), demonstrating its effectiveness in environments requiring memory and spatial reasoning.
- Ablation Studies: Confirmed the importance of key components: TPG was superior to alternatives (end-to-end gradients, standard PG for goals, absolute goals); dLSTM was crucial for Manager performance; directional goals outperformed absolute goals; intrinsic reward (α>0) was generally beneficial.
- Transfer Learning: Showed potential for transfer learning by successfully transferring a FuN agent trained with action repeat 4 to an environment with action repeat 1, outperforming baselines trained from scratch or transferred similarly. This suggests the learned hierarchical structure (especially the Manager's transition policy) can generalize across changes in low-level dynamics.
Practical Considerations:
- Implementation: Requires implementing the dLSTM, the TPG update rule for the Manager, the intrinsic reward calculation for the Worker, and managing the two separate training loops within an RL framework like A3C.
- Computational Cost: Likely higher than a flat LSTM baseline due to the two RNNs and potentially longer BPTT unrolls (e.g., 400 steps for FuN vs. 40/100 for LSTM in experiments). However, it may be more sample efficient on complex tasks.
- Hyperparameters: Introduces new hyperparameters like the goal horizon
c
, dLSTM dilation r
, goal embedding dimension k
, and intrinsic reward weight alpha
, which may require tuning.
- Applicability: Well-suited for tasks with sparse rewards, long time horizons, requirements for temporal abstraction, or where hierarchical decomposition is natural (e.g., complex navigation, manipulation, strategy games). It might be less beneficial or overly complex for purely reactive tasks where long-term planning is not critical.