MLP-Policy Network in RL & MPC

Updated 29 December 2025

MLP-Policy Network is a deep feedforward model that maps high-dimensional observations to probability distributions over actions, serving as a foundational component in reinforcement learning.
It employs multiple fully connected layers with nonlinear activations and policy gradient techniques, including variance reduction, to optimize decision-making in complex environments.
Integration with model predictive control enables these networks to provide high-level, adaptive strategies that improve control accuracy and execution smoothness.

A Multi-Layer Perceptron Policy Network (MLP-Policy Network) is a parameterized policy architecture in which a deep feedforward neural network, specifically an MLP, maps structured or high-dimensional observations to probability distributions over actions or continuous action-valued variables. MLP-Policy Networks are widely used in reinforcement learning (RL), both for direct policy gradient methods and as high-level controllers interfacing with model predictive control (MPC) and related hybrid paradigms. These architectures are characterized by multiple fully connected layers with nonlinear activations, typically employing ReLU, to transform input features into actionable outputs that parameterize the learned policy π(a|s;θ) (Phon-Amnuaisuk, 2018, Song et al., 2020).

1. Network Architectures and Policy Parameterization

MLP-Policy Networks operate on preprocessed or raw state observations, such as high-dimensional pixel data or low-dimensional task features. For example, in deep RL for Atari Pong, the network processes 80×80 grayscale frames, flattened to 6,400-dimensional vectors with temporal differencing (x_t = frame_t – frame_{t–1}) to encode motion. The MLP architecture variants include:

Single hidden layer (H ∈ {100, 200, 400} units), ReLU activation.
Two hidden layers (“100:10”, i.e., 100 units followed by 10 units), ReLU activation.
Output layer: number of units corresponds to discrete action classes or output dimensionality, with Softmax or Sigmoid for discrete actions (e.g., Up, Down, Still in Pong).
Weights: W¹ ∈ ℝ^{H×D}, W² ∈ ℝ^{C×H} (D = input dimension, C = number of actions).

Policy parameterization is typically implemented as π(a=k|s;θ) = Softmax_k(z_k), or Sigmoid(W²·ReLU(W¹·x)) with one-hot action encoding (Phon-Amnuaisuk, 2018). In hierarchical control settings, the MLP may output continuous variables (e.g., predicted traversal time z_t = f_Φ(o_t)) as high-level parameters for a lower-level controller (Song et al., 2020).

2. Policy Gradient Formulations and Variance Reduction

The optimization objective for directly parameterized MLP policies is the maximization of expected return over episodic interactions, J(θ) = E_{τ∼π_θ} [R(τ)], commonly via policy gradient methods. The REINFORCE estimator is formulated as:

$J(θ) = E_{π_θ} \left[ \sum_{t=0}^{T-1} \log π(a_t|s_t; θ)·G_t \right]$

with G_t = ∑_{k=t}^{T–1} γ^{k–t} r_k denoting the discounted sum of future rewards. The likelihood-ratio trick is applied for gradient computation:

$∇_θ J(θ) = E_{π_θ} \left[ \sum_{t=0}^{T-1} ∇_θ \log π(a_t|s_t;θ)·G_t \right]$

Variance reduction is achieved by introducing a baseline b (e.g., episode-average return), resulting in the advantage estimator (G_t – b). This approach is unbiased and enhances sample efficiency. In asynchronous advantage actor-critic (A3C) extensions, actor and critic losses are combined with an entropy bonus for exploration:

C(θ) = β₁ L_π + β₂ L_v – β₃ H

(Phon-Amnuaisuk, 2018).

3. Training Regimes and Hyperparameter Configurations

Training MLP-Policy Networks involves on-policy or distributed paradigms, with hyperparameters chosen based on architecture depth and problem specifics. For policy-gradient updates using FFNN, the learning rate is set to α = 0.001 (single hidden layer) or 0.0001 (two layers), with discount factor γ = 0.9. Training proceeds for tens of thousands of episodes, with one episode per gradient update (on-policy). Optimizer is typically vanilla SGD with backpropagation. Label biasing is employed such that sampled one-hot action labels are biased towards the current policy's recommendation, maintaining exploration.

A3C implementations utilize multiple asynchronous agents (N ∈ {2, 4, 8}) with shared parameters, each accumulating gradients for up to t_{max} steps and atomically updating θ. This significantly reduces wall-clock training time and improves data efficiency. CNN front-ends can be added to the MLP to enhance performance on raw visual tasks (Phon-Amnuaisuk, 2018).

For high-level policy learning in MPC, the MLP is trained in a supervised fashion on (observation, z^*) pairs, where z^* are optimal decision variables determined via reward-weighted regression and probabilistic policy search. Mean squared error between predicted and optimal values is minimized, and backpropagation is restricted to the MLP module (Song et al., 2020).

4. Empirical Results and Comparative Analyses

Empirical findings demonstrate that MLP-Policy Networks are effective for high-dimensional end-to-end control, though architectural choices affect convergence speed and final performance:

In the Pong domain, single-hidden-layer MLPs converge in ~20K episodes (40+ hours), with larger networks (200/400 units) accelerating early learning but not altering final performance substantially.
Two-layer architectures (100:10) require a lower learning rate for stable training.
A3C with MLP policy achieves a linear speedup with more actors; 8 agents lead to ~4–5× faster learning than a single-agent setup.
CNN-augmented variants outperform pure MLPs both in convergence time and final policy quality (Phon-Amnuaisuk, 2018).

In the High-MPC context, a two-layer, 32-unit-per-layer MLP trained on 40,000 examples requires less than 5 minutes to fit, and predicts high-level hyperparameters for each observation, resulting in reduced traversal error (mean y-z error 0.24 m vs. 0.30 m for standard MPC) and smoother actuator usage (lower saturation time) (Song et al., 2020).

5. Internal Representation Analysis and Feature Visualization

Post-training analysis of MLP-Policy Networks reveals interpretable activation and weight patterns. Hidden-unit activations show clustering by action type; subclusters within each action group correspond to distinct game contexts (e.g., paddle behaviors in Pong). Visualizing first-layer weights by reshaping each W_i ∈ ℝ^{6400} to an 80×80 image shows that, after learning, the filters attend to task-relevant regions such as ball or paddle trajectories. Pretraining weights exhibit no structure, while post-training filters highlight end-to-end-learned spatial and temporal dependencies (Phon-Amnuaisuk, 2018). This suggests that, even with dense fully connected networks, relevant latent features for policy control can emerge without explicit hand-crafted inputs.

6. Integration with Model Predictive Control

MLP-Policy Networks are not restricted to direct action selection. In model-based scenarios, the MLP can act as a high-level policy that produces a continuous hyperparameter z_t, which then parameterizes a trajectory optimizer such as MPC at each timestep. This approach, as in "Learning High-Level Policies for Model Predictive Control" (Song et al., 2020), demonstrates that:

The MLP ingests local environment observations (e.g., 10-dimensional difference in quadrotor/pendulum system states) and outputs z_t (e.g., predicted traversal time).
The output is immediately used as a parameter for constrained nonlinear MPC, with the low-level controller generating optimal actions based on both the learned variable and the environment state.
This integration yields greater accuracy and smoother control compared to fixed-parameter MPC while allowing real-time online adaptation.

7. Implementation Practices and Considerations

Best practices for realizing performant MLP-Policy Networks include:

Preprocess input data (cropping, downsampling, grayscale conversion, temporal differencing) to reduce input dimensionality and encode motion cues.
Employ ReLU activations for stable gradient propagation.
Use output parameterizations that faithfully model the action space: Softmax for categorical, linear outputs for continuous variables.
Baseline subtraction for variance reduction is essential.
Learning rate selection is nontrivial: excessively high values yield instability; overly conservative rates slow progress.
Parallelized training via asynchronous actors (A3C) offers significant efficiency gains and decorrelated data.
Visualization of weights and activations provides diagnostic insight during training.
Systems targeting visual domains may benefit from convolutional preprocessing before the MLP.
In hybrid systems, treat MLP-based high-level decisions as fixed labels for supervised regression, independent of the trajectory optimizer (Phon-Amnuaisuk, 2018, Song et al., 2020).

References:

"Learning to Play Pong using Policy Gradient Learning" (Phon-Amnuaisuk, 2018)
"Learning High-Level Policies for Model Predictive Control" (Song et al., 2020)

PDF Markdown Chat (Pro)

References (2)

Learning to Play Pong using Policy Gradient Learning (2018)

Learning High-Level Policies for Model Predictive Control (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MLP-Policy Network.