MLP-VAE in Reinforcement Learning
- MLP-VAE is a neural generative model that combines MLP-based encoding/decoding with VAE principles to learn compact, probabilistic latent representations.
- It employs a bottleneck architecture for direct subgoal prediction, achieving lower latent prediction error compared to structured conditional VAEs.
- Empirical evaluations in visual navigation and RL tasks demonstrate notable improvements in success rates and efficient policy optimization.
A Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) is a neural generative model that combines a multi-layer perceptron (MLP) encoder and decoder architecture with the variational autoencoder (VAE) probabilistic framework. In the context of RL and imagination-based policy optimization, MLP-VAEs are frequently used to learn compact, denoised latent representations of states, with the MLP serving both as the deterministic function approximator for mappings between observation space and latent space, and as the core generative network responsible for “imagination” or generation of plausible future states. While the term MLP-VAE is used generically, a canonical modern instantiation is found in the ForeSIT (Foresight Imagination) module for visual navigation, wherein the VAE structure is reduced to a single MLP bottleneck for subgoal prediction in latent space (Moghaddam et al., 2021).
1. Core Principles of the MLP-VAE
The foundational mechanism underpinning the MLP-VAE framework is the variational autoencoder, which models a probabilistic generative process:
- An encoder parameterized by an MLP maps input data (e.g., pixel features, state vectors) to a posterior over latent variables.
- A decoder, also an MLP, inverts this mapping to reconstruct the input or generate new samples from latent code.
- The training objective is the maximization of the evidence lower bound (ELBO), balancing a reconstruction loss and a Kullback–Leibler divergence regularization enforcing the latent space prior.
In imagination-based RL modules such as ForeSIT, the MLP-VAE instantiates the imagination network , a deep fully-connected MLP that predicts a future “success-relevant” latent state given the current (or initial) latent state and goal embedding . The bottleneck structure of the MLP maps this input to a compact latent space, paralleling the information bottleneck enforced by classic VAE architectures.
2. Mathematical Formulation
Let denote the input observation or feature vector at time , and a goal embedding. The ForeSIT instantiation uses
- Encoder (shared with RL policy): a single-layer LSTM with 512 hidden units, processing ResNet-18 visual features and to yield state .
- Imagination MLP: a 6-layer, 512-unit fully-connected network with tanh activations,
- Output: imagined goal latent in .
The imagination (subgoal prediction) loss is
where is the most “attended” (impactful) past state within a successful trajectory, and is a replay buffer.
Notably, direct MLP-based imagination outperforms conditional-structured VAEs when the latter are substituted, as the direct mapping exhibits lower prediction error (0.012 vs. 0.018; see ablation in (Moghaddam et al., 2021)), apparently due to lower variance and tighter bottleneck compression for goal-relevant subgoal states.
3. Workflow and Training Algorithm
The typical MLP-VAE (as in ForeSIT) training procedure is tightly coupled with on-policy reinforcement learning:
- Initialize policy, value (actor-critic) parameters , attention parameters , and imagination parameters .
- At episode onset, encode , compute via the MLP, inject noise governed by task success rate.
- At each step, condition the actor-critic on , current , and ; execute action, collect transition.
- Compute attention over the trajectory to isolate (impactful subgoal state).
- Upon successful episode, store in buffer ; periodically update by minimizing the imagination loss over .
- Iterate until convergence. Complete pseudocode for this on-policy joint training process is specified in (Moghaddam et al., 2021).
Noise variance injected into is annealed with the moving average task success rate to regularize the policy’s dependence on imagined futures.
4. Empirical Evaluation and Structured VAE Ablation
Empirical evaluation—AI2-THOR target-object navigation task—shows that augmenting policy learning with the MLP-VAE-based imagination network yields improvements:
- A3C baseline SR ≈ 33%, with ForeSIT SR ≈ 36.8% (+3.8 pts).
- A3C+MAML baseline SR ≈ 40.9%, with ForeSIT SR ≈ 45.5% (+4.6 pts).
- A3C+ORG SR ≈ 65.3%, with ForeSIT SR ≈ 68.0% (+2.7 pts).
Replacing direct MLP imagination with structured conditional-VAE increases latent prediction error (0.018 vs. 0.012), indicating that shallow MLP bottlenecks can be preferable to more complex probabilistic VAEs when the learning problem is dominated by subgoal compression and when the generative task is tightly structured by episodic RL goals (Moghaddam et al., 2021).
5. Generalization and Extension to Other Domains
The MLP-VAE pattern—an MLP mapping from a current or initial latent state and goal embedding to a success-relevant latent future—generalizes beyond visual navigation. The architecture is directly extensible to:
- Robotic manipulation: imagination of key grasp/configuration subgoals in object-centric latent space.
- Multi-agent coordination: jointly imagined subgoals for coordination/safety.
- Goal-conditioned locomotion: prediction of intermediate contact or foothold representations.
- Hierarchical RL: use as a learned high-level subgoal proposer for lower-level policies.
- Continuous control (e.g., autonomous driving): imagination of latent keyframes for state transitions such as lane changes.
In each case, the structural role of the MLP-VAE is to enforce a bottleneck that encodes the information necessary for robust reachability and credit assignment within an on-policy, goal-conditioned RL loop (Moghaddam et al., 2021).
6. Scope, Limitations, and Interpretive Notes
While the MLP-VAE (as instantiated by ForeSIT) is highly effective for structured tasks with episodic rewards and learnable compact subgoal representations, its expressivity is limited by the capacity of the MLP and the suitability of direct regression for future subgoal states. Attempts to increase expressivity via full conditional VAEs may increase error due to overparameterization and weak priors for localized subgoal structure in latent space. A plausible implication is that, for imagination modules whose targets remain highly concentrated around low-dimensional goal manifolds, a simple MLP bottleneck may be empirically optimal compared to sampling-based variational methods.
Broader application of the MLP-VAE approach requires:
- Careful alignment between the latent state space used by both policy and imagination MLP.
- Sufficient sample efficiency for reliable identification of impactful subgoal states via attention mechanisms.
- Task structures amenable to clear success/failure delineation enabling the episodic recording of successful subgoal states (Moghaddam et al., 2021).
In summary, the MLP-VAE—embodied by deep, fully-connected bottleneck networks applied to latent state subgoal imagination—provides a computationally efficient, empirically effective alternative to classical structured VAEs for imagination-based policy optimization in vision-based control and beyond.