Papers
Topics
Authors
Recent
2000 character limit reached

Deep GP Proximal Policy Optimization

Updated 29 November 2025
  • The paper introduces GPPO, extending PPO with deep Gaussian processes to jointly approximate policy and value functions for improved uncertainty estimation.
  • It employs a Deep Sigma-Point Process variant to deterministically propagate uncertainty, enabling robust exploration in high-dimensional control tasks.
  • Empirical results on benchmarks like Walker2D and Humanoid reveal competitive performance and enhanced robustness under dynamic perturbations.

Deep Gaussian Process Proximal Policy Optimization (GPPO) is a scalable, model-free actor-critic reinforcement learning algorithm that employs Deep Gaussian Processes (DGPs) to jointly approximate the policy and value function. Unlike conventional deep neural networks, GPPO offers calibrated uncertainty estimates, facilitating safer and more effective exploration in high-dimensional continuous control environments. It incorporates the Deep Sigma-Point Process (DSPP) variant of DGPs, allowing deterministic propagation of uncertainty via learned quadrature (“sigma”) points and variational inference with inducing inputs. Empirical evaluations demonstrate that GPPO retains or improves upon the benchmark performance of Proximal Policy Optimization (PPO) while providing robust uncertainty-aware exploration strategies (Lende et al., 22 Nov 2025).

1. DGP Actor–Critic Architecture

GPPO replaces the standard neural-network-based actor and critic modules with two Deep Gaussian Processes. Each DGP comprises multiple layers (LL), where each layer \ell consists of DD_\ell independent Gaussian Processes:

f,i()GP(m,i(),k,i(,)),i=1,,D.f_{\ell,i}(\cdot) \sim \text{GP}(m_{\ell,i}(\cdot), k_{\ell,i}(\cdot, \cdot)), \quad i=1,\ldots, D_\ell.

For efficient approximation, each GP employs MM inducing inputs Z,iZ_{\ell,i} and outputs U,iU_{\ell,i}, with standard GP priors and layerwise variational approximations:

p(U,iZ,i)=N(0,K,i(Z,i,Z,i)),q(U,i)=N(m,i,S,i).p(U_{\ell,i}|Z_{\ell,i}) = \mathcal{N}(0, K_{\ell,i}(Z_{\ell,i}, Z_{\ell,i})), \qquad q(U_{\ell,i}) = \mathcal{N}(m_{\ell,i}, S_{\ell,i}).

A KL-regularized joint variational objective is constructed across layers:

LKL=iDKL(q(U,i)p(U,iZ1)).\mathcal{L}_{\text{KL}} = \sum_{\ell} \sum_{i} D_{\text{KL}}( q(U_{\ell,i}) || p(U_{\ell,i}|Z_{\ell-1}) ).

The algorithm employs a squared-exponential (RBF) kernel for each GP, with layer- and index-specific length-scales and output scales.

The DSPP variant deterministically propagates QQ learned sigma points (w(j),x(j))(w^{(j)}, x^{(j)}) through each layer, yielding at the output a mixture approximation:

pdspp(yx)=j=1Qw(j)N(yμ(j)(x),σ(j)2(x)).p_{\text{dspp}}(y|x) = \sum_{j=1}^Q w^{(j)} \mathcal{N}(y| \mu^{(j)}(x), \sigma^{(j)2}(x) ).

Each w(j)w^{(j)} and x(j)x^{(j)} is learned jointly with variational parameters and kernel hyperparameters.

2. GPPO Objective Formulation

Building on the clipped surrogate objective of PPO, the GPPO loss function maximizes expected advantage while regularizing policy entropy and enforcing Bayesian consistency:

LGPPO(ϕ)=E^t[min(rt(ϕ)A^t,  clip(rt(ϕ),1ϵ,1+ϵ)A^t)c1((Vttargetμ(st))22σ2(st)+12logσ2(st))+c2S[πϕ](st)]β,iDKL(q(U,i)p(U,iZ1)).\mathcal{L}_{\text{GPPO}}(\phi) = \hat{\mathbb{E}}_t \left[ \min \left(r_t(\phi)\hat{A}_t, \; \operatorname{clip}(r_t(\phi),1-\epsilon,1+\epsilon)\hat{A}_t \right) - c_1 \left( \tfrac{(V_t^{\text{target}}-\mu(s_t))^2}{2\sigma^2(s_t)} + \tfrac{1}{2}\log \sigma^2(s_t) \right) + c_2\,S[\pi_\phi](s_t) \right] - \beta \sum_{\ell,i} D_{\text{KL}}( q(U_{\ell,i}) || p(U_{\ell,i}|Z^{\ell-1}) ).

Here:

  • rt(ϕ)=πϕ(atst)/πold(atst)r_t(\phi) = \pi_\phi(a_t|s_t) / \pi_{\text{old}}(a_t|s_t) is the policy likelihood ratio,
  • A^t\hat{A}_t is the advantage estimate (from sampled DGP value functions),
  • S[πϕ](st)S[\pi_\phi](s_t) is the entropy bonus, and
  • The value head log-likelihood utilizes DSPP mixture scoring.

This formulation preserves the stability and trust-region behavior of PPO, while enforcing a proper Bayesian scoring rule for regression and KL regularization for variational posteriors.

3. Training Procedure and Workflow

GPPO employs an iterative two-stage algorithm for policy improvement and value estimation:

  1. Initialization:
    • Initialize DGP parameters ϕ\phi, including kernel hyperparameters, inducing point locations, quadrature points, and variational parameters.
    • Copy parameters to ϕold\phi_{\text{old}} for rollout sampling.
  2. Rollout Collection:
    • For each time step tt, sample atπ(st;ϕold)a_t \sim \pi(\cdot|s_t; \phi_{\text{old}}) using GP policy outputs (mixture).
    • Sample V~(st)pdspp(Vtargetst;ϕ)\tilde{V}(s_t) \sim p_{\text{dspp}}(V_{\text{target}} | s_t; \phi ) from the DGP value head.
    • Log transition (st,at,rt+1,done,logπ(atst;ϕold),V~(st))(s_t, a_t, r_{t+1}, \text{done}, \log \pi(a_t|s_t; \phi_{\text{old}}), \tilde{V}(s_t)).
  3. Advantage Computation:
    • Compute advantage A^t\hat{A}_t via Generalized Advantage Estimation (GAE), leveraging samples from the GP value head.
  4. Optimization:
    • For several epochs KK and minibatches, maximize LGPPO(ϕ)\mathcal{L}_{\text{GPPO}}(\phi) via Adam optimization applied to gradient estimates.
  5. Update Reference:
    • After each update cycle, propagate parameters: ϕoldϕ\phi_{\text{old}} \leftarrow \phi.

This architecture supports parallelized mini-batch training and empirical evaluation on continuous control benchmarks.

4. Uncertainty Quantification and Exploration Dynamics

GPPO’s uncertainty estimation originates from predictive variance of DSPP heads. For any input ss:

μ(s)=jw(j)μ(j)(s),σ2(s)=jw(j)[(μ(j)(s)μ(s))2+σ(j)2(s)].\mu(s) = \sum_{j} w^{(j)} \mu^{(j)}(s), \qquad \sigma^2(s) = \sum_{j} w^{(j)} \left[ (\mu^{(j)}(s) - \mu(s))^2 + \sigma^{(j)2}(s) \right].

The policy head’s predictive variance σ2(as)\sigma^2(a|s) corresponds to entropy in πϕ\pi_\phi, incentivizing exploration in regions of high uncertainty. The value head’s sampled predictions enable randomization of advantage estimates, analogous to Thompson Sampling within value estimation.

This approach results in calibrated uncertainty—propagated both in decision-making and value assessment—enabling safer, adaptive exploration especially when environmental dynamics are uncertain or nonstationary.

5. Computational Scalability and DSPP Efficiency

Classic GP inference scales cubically in the number of data points (O(N3)\mathcal{O}(N^3)). GPPO employs variational inference with inducing points (MNM \ll N), reducing complexity in each GP block to O(NM2)\mathcal{O}(N M^2) per update and total computational cost to O(NM2D)\mathcal{O}(N M^2 \sum_\ell D_\ell) for LL layers.

Typical parameter settings in experimental evaluation use M[128,256]M \in [128,256] and Q8Q \approx 8 sigma points. These quantities suffice for accurate approximation in benchmark tasks. Memory scaling is O(NMD)\mathcal{O}(N M \sum_{\ell} D_\ell).

Computationally, GPPO incurs approximately $7$–8×8\times overhead relative to PPO per environment step (13\approx 13 ms vs $2$ ms for action inference) and 4×4\times overhead per training update, remaining practical on modern consumer GPUs.

6. Empirical Results and Benchmark Comparisons

GPPO was empirically evaluated on the Gymnasium Walker2D-v5 and Humanoid-v5 benchmarks, using training durations of $10$k (Walker2D) and $15$k (Humanoid) episodes, with $3$ random seeds and interquartile mean (IQM) returns reported using 95%95\% bootstrap confidence intervals.

Walker2D Results:

  • Final IQM Returns: PPO $742.16$ (CI [686.7,1081.7][686.7, 1081.7]), GPPO $2525.06$ (CI [2155.2,2904.0][2155.2, 2904.0]).
  • Evaluation (mean ±\pm std, $100$ episodes): PPO 654.8±117.3654.8 \pm 117.3, GPPO 1611.4±686.11611.4 \pm 686.1.

Humanoid Results:

  • Final IQM Returns: PPO $349.64$ (CI [321.4,377.5][321.4, 377.5]), GPPO $248.43$ (CI [244.1,252.6][244.1, 252.6]).
  • Evaluation: PPO 300.4±78.1300.4 \pm 78.1, GPPO 287.4±20.9287.4 \pm 20.9.

GPPO demonstrates improved robustness under dynamics perturbation (e.g., modified gravity settings), outperforming PPO in several upset scenarios. The increased training time is offset by the superior uncertainty quantification and exploration capabilities, especially in environments with complex or variable dynamics.

7. Methodological Significance and Application Scope

GPPO systematically extends PPO with fully Bayesian actor-critic learning via scalable DGPs. The method bridges tractable model uncertainty approximation, calibrated exploration, variational induction scalability, and integration with high-performance RL benchmarks. A plausible implication is the suitability of GPPO for safety-critical control domains, where Bayesian exploration mechanisms are integral. The interchangeable use of DSPP as a deterministic mixture surrogate to Monte Carlo sampling offers reduced variance and tractable learning dynamics in reinforcement learning pipelines.

Overall, GPPO establishes the feasibility of uncertainty-aware actor-critic reinforcement learning at scale, with competitive or superior empirical results compared to conventional deep neural policy architectures (Lende et al., 22 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Deep Gaussian Process Proximal Policy Optimization (GPPO).