Relative Pearson Divergence PPO-RPE

Updated 30 June 2026

Relative Pearson Divergence (PPO-RPE) is a method that leverages relative density ratio estimation to achieve robust and symmetric policy regularization in reinforcement learning.
It replaces traditional asymmetric density ratio penalties with an α-relative Pearson divergence, offering smoother behavior even with small or non-overlapping distribution supports.
The technique enhances nonparametric estimation, adaptive thresholding, and stable policy updates, yielding improved performance in complex reinforcement learning tasks.

Relative Pearson Divergence (PPO-RPE) is a class of methods in reinforcement learning and statistical distribution comparison that leverages the relative Pearson f-divergence for robust, stable, and symmetric proximal regularization. The central concept is the replacement of standard density or policy ratio-based penalties with a divergence objective rooted in the relative density ratio between probability distributions, which exhibits smoother and more controlled behavior, especially under small or non-overlapping supports. Widely adopted under the rubric "PPO-RPE", this approach has demonstrable benefits in nonparametric estimation, regularized policy optimization, and adaptive thresholding in policy updates (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).

1. Formulation of the Relative Pearson Divergence

Given two densities $p(x)$ (target) and $q(x)$ (reference) on $\mathbb{R}^d$ , the $\alpha$ -relative density ratio is defined as

$r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$

where $\alpha \in [0,1]$ . For statistical comparison and policy analysis, $\alpha$ interpolates between full reliance on $q$ ( $\alpha=0$ ) and $p$ ( $q(x)$ 0). The relative ratio $q(x)$ 1 is always bounded above by $q(x)$ 2 for $q(x)$ 3 and is strictly smoother than the ordinary density ratio $q(x)$ 4, greatly improving numerical stability when $q(x)$ 5 is small or vanishing.

The corresponding $q(x)$ 6-relative Pearson divergence is

$q(x)$ 7

This formulation can be seamlessly transferred to policy distributions in reinforcement learning, where $q(x)$ 8 may correspond to the updated policy and $q(x)$ 9 to the baseline policy (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).

2. Direct Relative Density-Ratio Estimation

Rather than decomposing density ratio estimation into separate estimates for $\mathbb{R}^d$ 0 and $\mathbb{R}^d$ 1, PPO-RPE and related procedures utilize direct fitting of $\mathbb{R}^d$ 2 from samples, often via regularized least-squares in a reproducing kernel Hilbert space (RKHS).

Given two sample sets $\mathbb{R}^d$ 3, $\mathbb{R}^d$ 4, and a kernel function $\mathbb{R}^d$ 5, the empirical risk is

$\mathbb{R}^d$ 6

where $\mathbb{R}^d$ 7 parametrizes the relative density ratio. The optimization

$\mathbb{R}^d$ 8

has a closed-form solution in RKHS. The estimator $\mathbb{R}^d$ 9 can be evaluated out-of-sample.

Empirical plug-in estimates for $\alpha$ 0 include

$\alpha$ 1
$\alpha$ 2

This approach, referenced as RuLSIF within the literature, yields minimax-optimal nonparametric convergence rates and finite sample variances independent of model dimension in the parametric regime (Yamada et al., 2011).

3. Application to Policy Regularization: PPO-RPE

In Proximal Policy Optimization (PPO), the standard practice regularizes the updated policy $\alpha$ 3 relative to a baseline $\alpha$ 4 by constraining the density ratio $\alpha$ 5 within a symmetric interval $\alpha$ 6. However, $\alpha$ 7 is inherently asymmetric, leading to unbalanced regularization and unclear minimization targets (Kobayashi, 2020, Kobayashi, 2022).

PPO-RPE introduces the RPE divergence in policy update objectives:

The relative ratio for policies is $\alpha$ 8, with $\alpha$ 9.
The RPE divergence between policies is

$r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 0

The surrogate objective includes a penalty

$r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 1

yielding a regularized loss

$r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 2

This framework enables explicit control over update size, preserves symmetry in clipping thresholds with respect to $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 3, and yields a mathematically principled minimization target (Kobayashi, 2020).

4. Symmetry, Threshold Design, and Adaptive Tuning

A critical property is the symmetry of the relative ratio:

For $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 4, $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 5 is symmetric about $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 6; $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 7.
This symmetry allows clean, balanced regularization of excursions above and below $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 8, independent of the reference policy.
The maximum possible deviation is $r_\alpha(x) = \frac{p(x)}{q_\alpha(x)}, \qquad q_\alpha(x) = \alpha p(x) + (1-\alpha) q(x)$ 9; thus, the error scale is task- and algorithm-agnostic (Kobayashi, 2022).

Thresholds are set in the symmetric domain for $\alpha \in [0,1]$ 0, while the raw ratio threshold,

$\alpha \in [0,1]$ 1

is asymmetric in $\alpha \in [0,1]$ 2 but consistent with the geometry of density changes. Adaptive thresholds may be computed as

$\alpha \in [0,1]$ 3

where $\alpha \in [0,1]$ 4 tracks the observed maximal deviation in $\alpha \in [0,1]$ 5 and smoothing hyperparameters $\alpha \in [0,1]$ 6 provide numeric control (Kobayashi, 2022).

5. PPO-RPE Algorithmic Procedure

The iterative PPO-RPE algorithm proceeds as follows (Kobayashi, 2020, Kobayashi, 2022):

Collect trajectories under the baseline policy $\alpha \in [0,1]$ 7.
Compute advantage estimates $\alpha \in [0,1]$ 8 for each sample.
For each $\alpha \in [0,1]$ $α \in [0, 1]$ 9:
- Calculate $\alpha$ 0 and $\alpha$ 1.
- Set the penalty coefficient $\alpha$ 2 such that the regularized gradient vanishes at $\alpha$ 3.
- Modify the surrogate advantage to incorporate RPE penalty.
- Apply gradient descent using the adjusted advantage.
Adaptively update the baseline policy and the threshold $\alpha$ 4.

Empirical loss curves and score statistics indicate that PPO-RPE matches or outperforms standard PPO, especially in tasks where standard PPO's symmetric clipping leads to instability or premature convergence (Kobayashi, 2020, Kobayashi, 2022).

6. Theoretical Guarantees and Empirical Results

PPO-RPE provides several theoretical and practical benefits over standard PPO:

The update is always bound at the symmetric thresholds for $\alpha$ 5; gradients unequivocally vanish there, giving explicit "proximal" control.
The divergence minimization has well-defined analytic properties: in nonparametric domains, the sup-norm and rate of convergence are controlled, and in parametric regimes, the asymptotic variance does not depend on the number of model parameters (Yamada et al., 2011).
The symmetric structure ensures balanced exploration and stability.

Empirical evaluation across control and robotic locomotion tasks has demonstrated:

PPO-RPE achieves superior stability in both low- and high-dimensional settings.
Adaptive thresholding further enhances learning on difficult, unstable tasks, exceeding the success rates of fixed-threshold or baseline PPO algorithms.
No task-specific tuning of $\alpha$ 6 is required for PPO-RPE-A (adaptive threshold), and the learned thresholds maintain policy divergence within safe bands (Kobayashi, 2022).

7. Applications and Extensions

Relative Pearson divergence methods underpin a range of statistical and learning machinery:

Two-sample homogeneity testing via permutation tests, with lower type I and type II errors in medium $\alpha$ 7 regimes.
Inlier-based outlier detection using $\alpha$ 8 as a statistic, providing improvements in area under curve (AUC) measures in high-dimensional settings.
Transfer learning under covariate shift, where the use of relative instead of raw importance weights yields stabilized estimates and improved regression or classification performance.

Within reinforcement learning, the RPE divergence's unique properties facilitate more robust and explainable policy regularization, unifying f-divergence control with algorithm-adaptive constraints (Yamada et al., 2011, Kobayashi, 2020, Kobayashi, 2022).