Deep GP Proximal Policy Optimization

Updated 29 November 2025

The paper introduces GPPO, extending PPO with deep Gaussian processes to jointly approximate policy and value functions for improved uncertainty estimation.
It employs a Deep Sigma-Point Process variant to deterministically propagate uncertainty, enabling robust exploration in high-dimensional control tasks.
Empirical results on benchmarks like Walker2D and Humanoid reveal competitive performance and enhanced robustness under dynamic perturbations.

Deep Gaussian Process Proximal Policy Optimization (GPPO) is a scalable, model-free actor-critic reinforcement learning algorithm that employs Deep Gaussian Processes (DGPs) to jointly approximate the policy and value function. Unlike conventional deep neural networks, GPPO offers calibrated uncertainty estimates, facilitating safer and more effective exploration in high-dimensional continuous control environments. It incorporates the Deep Sigma-Point Process (DSPP) variant of DGPs, allowing deterministic propagation of uncertainty via learned quadrature (“sigma”) points and variational inference with inducing inputs. Empirical evaluations demonstrate that GPPO retains or improves upon the benchmark performance of Proximal Policy Optimization (PPO) while providing robust uncertainty-aware exploration strategies (Lende et al., 22 Nov 2025).

1. DGP Actor–Critic Architecture

GPPO replaces the standard neural-network-based actor and critic modules with two Deep Gaussian Processes. Each DGP comprises multiple layers ( $L$ ), where each layer $\ell$ consists of $D_\ell$ independent Gaussian Processes:

$f_{\ell,i}(\cdot) \sim \text{GP}(m_{\ell,i}(\cdot), k_{\ell,i}(\cdot, \cdot)), \quad i=1,\ldots, D_\ell.$

For efficient approximation, each GP employs $M$ inducing inputs $Z_{\ell,i}$ and outputs $U_{\ell,i}$ , with standard GP priors and layerwise variational approximations:

$p(U_{\ell,i}|Z_{\ell,i}) = \mathcal{N}(0, K_{\ell,i}(Z_{\ell,i}, Z_{\ell,i})), \qquad q(U_{\ell,i}) = \mathcal{N}(m_{\ell,i}, S_{\ell,i}).$

A KL-regularized joint variational objective is constructed across layers:

$\mathcal{L}_{\text{KL}} = \sum_{\ell} \sum_{i} D_{\text{KL}}( q(U_{\ell,i}) || p(U_{\ell,i}|Z_{\ell-1}) ).$

The algorithm employs a squared-exponential (RBF) kernel for each GP, with layer- and index-specific length-scales and output scales.

The DSPP variant deterministically propagates $Q$ learned sigma points $(w^{(j)}, x^{(j)})$ through each layer, yielding at the output a mixture approximation:

$p_{\text{dspp}}(y|x) = \sum_{j=1}^Q w^{(j)} \mathcal{N}(y| \mu^{(j)}(x), \sigma^{(j)2}(x) ).$

Each $w^{(j)}$ and $x^{(j)}$ is learned jointly with variational parameters and kernel hyperparameters.

2. GPPO Objective Formulation

Building on the clipped surrogate objective of PPO, the GPPO loss function maximizes expected advantage while regularizing policy entropy and enforcing Bayesian consistency:

$\mathcal{L}_{\text{GPPO}}(\phi) = \hat{\mathbb{E}}_t \left[ \min \left(r_t(\phi)\hat{A}_t, \; \operatorname{clip}(r_t(\phi),1-\epsilon,1+\epsilon)\hat{A}_t \right) - c_1 \left( \tfrac{(V_t^{\text{target}}-\mu(s_t))^2}{2\sigma^2(s_t)} + \tfrac{1}{2}\log \sigma^2(s_t) \right) + c_2\,S[\pi_\phi](s_t) \right] - \beta \sum_{\ell,i} D_{\text{KL}}( q(U_{\ell,i}) || p(U_{\ell,i}|Z^{\ell-1}) ).$

Here:

$r_t(\phi) = \pi_\phi(a_t|s_t) / \pi_{\text{old}}(a_t|s_t)$ is the policy likelihood ratio,
$\hat{A}_t$ is the advantage estimate (from sampled DGP value functions),
$S[\pi_\phi](s_t)$ is the entropy bonus, and
The value head log-likelihood utilizes DSPP mixture scoring.

This formulation preserves the stability and trust-region behavior of PPO, while enforcing a proper Bayesian scoring rule for regression and KL regularization for variational posteriors.

3. Training Procedure and Workflow

GPPO employs an iterative two-stage algorithm for policy improvement and value estimation:

Initialization:
- Initialize DGP parameters $\phi$ , including kernel hyperparameters, inducing point locations, quadrature points, and variational parameters.
- Copy parameters to $\phi_{\text{old}}$ for rollout sampling.
Rollout Collection:
- For each time step $t$ , sample $a_t \sim \pi(\cdot|s_t; \phi_{\text{old}})$ using GP policy outputs (mixture).
- Sample $\tilde{V}(s_t) \sim p_{\text{dspp}}(V_{\text{target}} | s_t; \phi )$ from the DGP value head.
- Log transition $(s_t, a_t, r_{t+1}, \text{done}, \log \pi(a_t|s_t; \phi_{\text{old}}), \tilde{V}(s_t))$ .
Advantage Computation:
- Compute advantage $\hat{A}_t$ via Generalized Advantage Estimation (GAE), leveraging samples from the GP value head.
Optimization:
- For several epochs $K$ and minibatches, maximize $\mathcal{L}_{\text{GPPO}}(\phi)$ via Adam optimization applied to gradient estimates.
Update Reference:
- After each update cycle, propagate parameters: $\phi_{\text{old}} \leftarrow \phi$ .

This architecture supports parallelized mini-batch training and empirical evaluation on continuous control benchmarks.

4. Uncertainty Quantification and Exploration Dynamics

GPPO’s uncertainty estimation originates from predictive variance of DSPP heads. For any input $s$ :

$\mu(s) = \sum_{j} w^{(j)} \mu^{(j)}(s), \qquad \sigma^2(s) = \sum_{j} w^{(j)} \left[ (\mu^{(j)}(s) - \mu(s))^2 + \sigma^{(j)2}(s) \right].$

The policy head’s predictive variance $\sigma^2(a|s)$ corresponds to entropy in $\pi_\phi$ , incentivizing exploration in regions of high uncertainty. The value head’s sampled predictions enable randomization of advantage estimates, analogous to Thompson Sampling within value estimation.

This approach results in calibrated uncertainty—propagated both in decision-making and value assessment—enabling safer, adaptive exploration especially when environmental dynamics are uncertain or nonstationary.

5. Computational Scalability and DSPP Efficiency

Classic GP inference scales cubically in the number of data points ( $\mathcal{O}(N^3)$ ). GPPO employs variational inference with inducing points ( $M \ll N$ ), reducing complexity in each GP block to $\mathcal{O}(N M^2)$ per update and total computational cost to $\mathcal{O}(N M^2 \sum_\ell D_\ell)$ for $L$ layers.

Typical parameter settings in experimental evaluation use $M \in [128,256]$ and $Q \approx 8$ sigma points. These quantities suffice for accurate approximation in benchmark tasks. Memory scaling is $\mathcal{O}(N M \sum_{\ell} D_\ell)$ .

Computationally, GPPO incurs approximately $7$– $8\times$ overhead relative to PPO per environment step ( $\approx 13$ ms vs $2$ ms for action inference) and $4\times$ overhead per training update, remaining practical on modern consumer GPUs.

6. Empirical Results and Benchmark Comparisons

GPPO was empirically evaluated on the Gymnasium Walker2D-v5 and Humanoid-v5 benchmarks, using training durations of $10$k (Walker2D) and $15$k (Humanoid) episodes, with $3$ random seeds and interquartile mean (IQM) returns reported using $95\%$ bootstrap confidence intervals.

Walker2D Results:

Final IQM Returns: PPO $742.16$ (CI $[686.7, 1081.7]$ ), GPPO $2525.06$ (CI $[2155.2, 2904.0]$ ).
Evaluation (mean $\pm$ std, $100$ episodes): PPO $654.8 \pm 117.3$ , GPPO $1611.4 \pm 686.1$ .

Humanoid Results:

Final IQM Returns: PPO $349.64$ (CI $[321.4, 377.5]$ ), GPPO $248.43$ (CI $[244.1, 252.6]$ ).
Evaluation: PPO $300.4 \pm 78.1$ , GPPO $287.4 \pm 20.9$ .

GPPO demonstrates improved robustness under dynamics perturbation (e.g., modified gravity settings), outperforming PPO in several upset scenarios. The increased training time is offset by the superior uncertainty quantification and exploration capabilities, especially in environments with complex or variable dynamics.

7. Methodological Significance and Application Scope

GPPO systematically extends PPO with fully Bayesian actor-critic learning via scalable DGPs. The method bridges tractable model uncertainty approximation, calibrated exploration, variational induction scalability, and integration with high-performance RL benchmarks. A plausible implication is the suitability of GPPO for safety-critical control domains, where Bayesian exploration mechanisms are integral. The interchangeable use of DSPP as a deterministic mixture surrogate to Monte Carlo sampling offers reduced variance and tractable learning dynamics in reinforcement learning pipelines.

Overall, GPPO establishes the feasibility of uncertainty-aware actor-critic reinforcement learning at scale, with competitive or superior empirical results compared to conventional deep neural policy architectures (Lende et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Deep Gaussian Process Proximal Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Deep Gaussian Process Proximal Policy Optimization (GPPO).