Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relative Pearson Divergence PPO-RPE

Updated 30 June 2026
  • Relative Pearson Divergence (PPO-RPE) is a method that leverages relative density ratio estimation to achieve robust and symmetric policy regularization in reinforcement learning.
  • It replaces traditional asymmetric density ratio penalties with an α-relative Pearson divergence, offering smoother behavior even with small or non-overlapping distribution supports.
  • The technique enhances nonparametric estimation, adaptive thresholding, and stable policy updates, yielding improved performance in complex reinforcement learning tasks.

Relative Pearson Divergence (PPO-RPE) is a class of methods in reinforcement learning and statistical distribution comparison that leverages the relative Pearson f-divergence for robust, stable, and symmetric proximal regularization. The central concept is the replacement of standard density or policy ratio-based penalties with a divergence objective rooted in the relative density ratio between probability distributions, which exhibits smoother and more controlled behavior, especially under small or non-overlapping supports. Widely adopted under the rubric "PPO-RPE", this approach has demonstrable benefits in nonparametric estimation, regularized policy optimization, and adaptive thresholding in policy updates (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).

1. Formulation of the Relative Pearson Divergence

Given two densities p(x)p(x) (target) and q(x)q(x) (reference) on Rd\mathbb{R}^d, the α\alpha-relative density ratio is defined as

rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)

where α[0,1]\alpha \in [0,1]. For statistical comparison and policy analysis, α\alpha interpolates between full reliance on qq (α=0\alpha=0) and pp (q(x)q(x)0). The relative ratio q(x)q(x)1 is always bounded above by q(x)q(x)2 for q(x)q(x)3 and is strictly smoother than the ordinary density ratio q(x)q(x)4, greatly improving numerical stability when q(x)q(x)5 is small or vanishing.

The corresponding q(x)q(x)6-relative Pearson divergence is

q(x)q(x)7

This formulation can be seamlessly transferred to policy distributions in reinforcement learning, where q(x)q(x)8 may correspond to the updated policy and q(x)q(x)9 to the baseline policy (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).

2. Direct Relative Density-Ratio Estimation

Rather than decomposing density ratio estimation into separate estimates for Rd\mathbb{R}^d0 and Rd\mathbb{R}^d1, PPO-RPE and related procedures utilize direct fitting of Rd\mathbb{R}^d2 from samples, often via regularized least-squares in a reproducing kernel Hilbert space (RKHS).

Given two sample sets Rd\mathbb{R}^d3, Rd\mathbb{R}^d4, and a kernel function Rd\mathbb{R}^d5, the empirical risk is

Rd\mathbb{R}^d6

where Rd\mathbb{R}^d7 parametrizes the relative density ratio. The optimization

Rd\mathbb{R}^d8

has a closed-form solution in RKHS. The estimator Rd\mathbb{R}^d9 can be evaluated out-of-sample.

Empirical plug-in estimates for α\alpha0 include

  • α\alpha1
  • α\alpha2

This approach, referenced as RuLSIF within the literature, yields minimax-optimal nonparametric convergence rates and finite sample variances independent of model dimension in the parametric regime (Yamada et al., 2011).

3. Application to Policy Regularization: PPO-RPE

In Proximal Policy Optimization (PPO), the standard practice regularizes the updated policy α\alpha3 relative to a baseline α\alpha4 by constraining the density ratio α\alpha5 within a symmetric interval α\alpha6. However, α\alpha7 is inherently asymmetric, leading to unbalanced regularization and unclear minimization targets (Kobayashi, 2020, Kobayashi, 2022).

PPO-RPE introduces the RPE divergence in policy update objectives:

  • The relative ratio for policies is α\alpha8, with α\alpha9.
  • The RPE divergence between policies is

rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)0

  • The surrogate objective includes a penalty

rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)1

yielding a regularized loss

rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)2

This framework enables explicit control over update size, preserves symmetry in clipping thresholds with respect to rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)3, and yields a mathematically principled minimization target (Kobayashi, 2020).

4. Symmetry, Threshold Design, and Adaptive Tuning

A critical property is the symmetry of the relative ratio:

  • For rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)4, rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)5 is symmetric about rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)6; rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)7.
  • This symmetry allows clean, balanced regularization of excursions above and below rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)8, independent of the reference policy.
  • The maximum possible deviation is rα(x)=p(x)qα(x),qα(x)=αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)9; thus, the error scale is task- and algorithm-agnostic (Kobayashi, 2022).

Thresholds are set in the symmetric domain for α[0,1]\alpha \in [0,1]0, while the raw ratio threshold,

α[0,1]\alpha \in [0,1]1

is asymmetric in α[0,1]\alpha \in [0,1]2 but consistent with the geometry of density changes. Adaptive thresholds may be computed as

α[0,1]\alpha \in [0,1]3

where α[0,1]\alpha \in [0,1]4 tracks the observed maximal deviation in α[0,1]\alpha \in [0,1]5 and smoothing hyperparameters α[0,1]\alpha \in [0,1]6 provide numeric control (Kobayashi, 2022).

5. PPO-RPE Algorithmic Procedure

The iterative PPO-RPE algorithm proceeds as follows (Kobayashi, 2020, Kobayashi, 2022):

  • Collect trajectories under the baseline policy α[0,1]\alpha \in [0,1]7.
  • Compute advantage estimates α[0,1]\alpha \in [0,1]8 for each sample.
  • For each α[0,1]\alpha \in [0,1]9:
    • Calculate α\alpha0 and α\alpha1.
    • Set the penalty coefficient α\alpha2 such that the regularized gradient vanishes at α\alpha3.
    • Modify the surrogate advantage to incorporate RPE penalty.
    • Apply gradient descent using the adjusted advantage.
  • Adaptively update the baseline policy and the threshold α\alpha4.

Empirical loss curves and score statistics indicate that PPO-RPE matches or outperforms standard PPO, especially in tasks where standard PPO's symmetric clipping leads to instability or premature convergence (Kobayashi, 2020, Kobayashi, 2022).

6. Theoretical Guarantees and Empirical Results

PPO-RPE provides several theoretical and practical benefits over standard PPO:

  • The update is always bound at the symmetric thresholds for α\alpha5; gradients unequivocally vanish there, giving explicit "proximal" control.
  • The divergence minimization has well-defined analytic properties: in nonparametric domains, the sup-norm and rate of convergence are controlled, and in parametric regimes, the asymptotic variance does not depend on the number of model parameters (Yamada et al., 2011).
  • The symmetric structure ensures balanced exploration and stability.

Empirical evaluation across control and robotic locomotion tasks has demonstrated:

  • PPO-RPE achieves superior stability in both low- and high-dimensional settings.
  • Adaptive thresholding further enhances learning on difficult, unstable tasks, exceeding the success rates of fixed-threshold or baseline PPO algorithms.
  • No task-specific tuning of α\alpha6 is required for PPO-RPE-A (adaptive threshold), and the learned thresholds maintain policy divergence within safe bands (Kobayashi, 2022).

7. Applications and Extensions

Relative Pearson divergence methods underpin a range of statistical and learning machinery:

  • Two-sample homogeneity testing via permutation tests, with lower type I and type II errors in medium α\alpha7 regimes.
  • Inlier-based outlier detection using α\alpha8 as a statistic, providing improvements in area under curve (AUC) measures in high-dimensional settings.
  • Transfer learning under covariate shift, where the use of relative instead of raw importance weights yields stabilized estimates and improved regression or classification performance.

Within reinforcement learning, the RPE divergence's unique properties facilitate more robust and explainable policy regularization, unifying f-divergence control with algorithm-adaptive constraints (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relative Pearson Divergence (PPO-RPE).