Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relative Pearson Divergence

Updated 16 June 2026
  • Relative Pearson Divergence is a parametric f-divergence that uses mixture-based regularization to yield bounded density ratios and robust distribution comparison.
  • It offers key mathematical properties such as boundedness, asymmetry, and smooth estimator behavior, which enhance numerical stability and prevent overfitting.
  • Applications in change-point detection, policy optimization in deep RL, and hypothesis testing demonstrate its practical advantages over standard divergences.

Relative Pearson Divergence (RPE) is a parametric family of f-divergences for measuring the difference between two probability distributions, characterized by a mixture-based regularization of the classic Pearson’s χ²-divergence. It is increasingly used in robust distribution comparison, machine learning, change-point detection, and deep reinforcement learning due to its favorable mathematical structure, numerical stability, and statistical properties. RPE divergence allows for direct density-ratio estimation, ensuring boundedness and smoother optimization landscapes relative to ordinary (non-relative) divergences.

1. Formal Definition and Construction

Given two probability densities p(x)p(x) and q(x)q(x) on X\mathcal{X} and a parameter α[0,1)\alpha\in[0,1) (or β[0,1]\beta\in[0,1]), define the mixture density: qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x) The α-relative density ratio is

rα(x)=p(x)qα(x)=p(x)αp(x)+(1α)q(x)r_\alpha(x) = \frac{p(x)}{q_\alpha(x)} = \frac{p(x)}{\alpha\,p(x)+(1-\alpha)\,q(x)}

The α-Relative Pearson Divergence is then the Pearson divergence of pp with respect to the mixture: RPEα(pq)=12(rα(x)1)2qα(x)dx\mathrm{RPE}_\alpha(p\parallel q) = \frac12 \int ( r_\alpha(x) - 1 )^2\, q_\alpha(x) \, dx This can be interpreted as measuring the “distance” between pp and q(x)q(x)0 not directly, but by interpolating in the space of densities for enhanced robustness and stability (Yamada et al., 2011, Liu et al., 2012, Kobayashi, 2020).

2. Relationship to Standard Pearson Divergence

The standard Pearson (χ²) divergence is given by: q(x)q(x)1 RPE replaces the denominator q(x)q(x)2 with the mixture q(x)q(x)3. For q(x)q(x)4, RPE reduces to the standard Pearson divergence. For q(x)q(x)5, the denominator gains positivity wherever q(x)q(x)6 is supported, ensuring the relative density ratio q(x)q(x)7 is uniformly bounded above by q(x)q(x)8. In contrast, in the standard case, the density ratio q(x)q(x)9 can diverge as X\mathcal{X}0 approaches zero (Yamada et al., 2011, Kobayashi, 2020). This property is critical for numerical stability and smoothness in estimation and learning.

3. Key Mathematical Properties

  • Boundedness: X\mathcal{X}1 for all X\mathcal{X}2 when X\mathcal{X}3.
  • Nonnegativity: X\mathcal{X}4 and X\mathcal{X}5 iff X\mathcal{X}6.
  • Asymmetry: X\mathcal{X}7 except when X\mathcal{X}8.
  • Monotonicity of r: The mapping from the raw ratio X\mathcal{X}9 to the relative ratio α[0,1)\alpha\in[0,1)0 is smooth, strictly increasing, and compresses large values for small α[0,1)\alpha\in[0,1)1, which moderates outlier impact.
  • Smooth estimator: Estimation of RPE is more stable and less prone to overfitting than estimation of the ordinary Pearson divergence, with asymptotic variance that can be independent of model complexity under parametric models (Yamada et al., 2011).

4. Estimation Algorithms: RuLSIF

The RuLSIF (Relative unconstrained Least-Squares Importance Fitting) method provides a direct estimation strategy for the relative density ratio α[0,1)\alpha\in[0,1)2 without explicit density estimation. A kernel expansion model

α[0,1)\alpha\in[0,1)3

is fitted by minimizing an empirical squared-error criterion: α[0,1)\alpha\in[0,1)4 with samples α[0,1)\alpha\in[0,1)5, α[0,1)\alpha\in[0,1)6, and regularization parameter α[0,1)\alpha\in[0,1)7. The closed-form minimizer is

α[0,1)\alpha\in[0,1)8

where α[0,1)\alpha\in[0,1)9 and β[0,1]\beta\in[0,1]0 are kernel-based empirical moment matrices (Yamada et al., 2011, Liu et al., 2012). The final divergence estimate is obtained via a plug-in: β[0,1]\beta\in[0,1]1

5. Theoretical Properties and Statistical Guarantees

Empirically and theoretically, the RPE and its estimators outperform their ordinary (non-relative) counterparts in both parametric and nonparametric contexts:

  • Nonparametric convergence rates: RPE estimators achieve standard β[0,1]\beta\in[0,1]2 rates, with tighter constants for larger β[0,1]\beta\in[0,1]3. The error terms diminish due to the boundedness of β[0,1]\beta\in[0,1]4.
  • Asymptotic variance: For parametric models, the variance of the RPE estimator does not depend on model complexity, due to smoothing in the relative ratio. This prevents overfitting even as model complexity increases (Yamada et al., 2011).
  • Consistency: Under standard RKHS assumptions, RuLSIF-based estimators of the RPE are statistically consistent.
  • Numerical stability: The boundedness and mixture denominator mitigate explosion of density ratios, yielding robust estimation even in high-dimensional and heavy-tailed regimes.

6. Algorithmic Details and Practical Implementation

A canonical RPE estimation and application workflow:

  1. Parameter and kernel selection: Choices for β[0,1]\beta\in[0,1]5, kernel bandwidth β[0,1]\beta\in[0,1]6, and regularization β[0,1]\beta\in[0,1]7 are typically determined by cross-validation, minimizing the empirical squared error.
  2. Matrix computation: Kernel matrices and empirical moment vectors are assembled using both β[0,1]\beta\in[0,1]8- and β[0,1]\beta\in[0,1]9-samples, weighted by the mixing parameter.
  3. System solution: The estimator reduces to a linear solve, requiring qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x)0 for qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x)1 basis functions, often selectable by basis subsampling for scalability.
  4. Divergence calculation: Once qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x)2 is computed, both qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x)3 and the final RPE can be efficiently estimated.
  5. Symmetrization: For many applications (such as change-point detection), the symmetrized divergence

qα(x)=αp(x)+(1α)q(x)q_\alpha(x) = \alpha\,p(x) + (1-\alpha)\,q(x)4

is used to ensure detection of distributional changes in either direction (Liu et al., 2012).

7. Applications and Empirical Performance

RPE divergence is widely applied in:

  • Change-point detection: RuLSIF-based RPE estimators enable robust, direct drift-detection in time series by comparing rolling windows. Empirical results show improved accuracy and numerical robustness on real-world datasets, such as speech signals and social media streams (Liu et al., 2012).
  • Policy optimization in RL: In PPO-RPE (Kobayashi, 2020), RPE regularization replaces conventional clipping regularizers. Here, RPE constrains new policy iterates to remain close to the baseline, enforces explicit minimization of the divergence measure, achieves automatic scaling of penalty strength according to advantage magnitude, and introduces principled, threshold-based regularization:
    • The relative ratio domain is bounded and symmetric, allowing balanced regularization of policy updates.
    • Asymmetry in the raw likelihood-ratio threshold arises from the mapping between the relative and ordinary domains, matching the intrinsic asymmetry of policy update landscapes.
    • Empirically, PPO-RPE matches or outperforms standard PPO and rollback variants on low- and high-dimensional benchmarks, providing better early-stage performance and greater stability by avoiding catastrophic policy shifts.
  • Hypothesis testing and two-sample tasks: In permutation-based homogeneity testing, symmetrized RPE yields Type-I error control and superior Type-II error performance over the plain Pearson divergence due to greater smoothness and boundedness (Yamada et al., 2011).
  • Outlier detection and transfer learning: RPE-based scoring improves robustness and stability of outlier detection and covariate shift reweighting (e.g., RuLSIF scoring for AUC maximization, and importance weighting for regression or classification tasks).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relative Pearson Divergence.