Rényi Divergence Gradient: Theory & Applications
- Rényi divergence gradients are explicit derivative formulations quantifying differences between probability distributions using tilted and weighted expectations.
- They underpin optimization techniques across classical statistics, variational inference, quantum information, and control through structured gradient flows and geometric insights.
- Their advanced formulation extends conventional divergences like the KL divergence, offering robust convergence properties in applications from deep learning to risk-sensitive policy updates.
Rényi divergence gradients underpin a family of information-theoretic optimization methods used in classical statistics, information geometry, quantum information theory, variational inference, machine learning, and control. The explicit forms, computational characteristics, and geometric structures of these gradients are central to theoretical analysis and algorithmic development across these fields.
1. Definition and General Formulas
The Rényi divergence of order , between probability measures and on a common domain, is
In the limit , this reduces to the Kullback-Leibler divergence. The gradient of with respect to parameters of a parametric distribution takes the generic form
where is the "tilted" distribution proportional to , i.e.,
An equivalent representation brings the expectation under with non-linear weights
so that
These formulas provide the foundation for Rényi-based optimization in both continuous and discrete domains (Ito et al., 4 Nov 2024).
2. Discrete and Geometric Interpretations
For discrete probability vectors , the gradient with respect to is
where (Wong, 2017). In vector notation, this is
This view aligns with the geometry of statistical manifolds: Rényi divergence gradients are associated with canonical divergences on dually flat or constant curvature manifolds, connecting to optimal transport and Bregman geometry (Wong, 2017). Flows induced by these gradients follow geodesics defined by the underlying Riemannian metric.
3. Fokker-Planck Equations and Gradient-Flow Structure
On , for the Fokker-Planck PDE with drift and strictly convex potential , the time-derivative (gradient flow) of Rényi divergence along the solution has the form
where , , and is the "escort" distribution. is the relative -Fisher information. This framework enables exponential decay rates for the Rényi divergence, extending log-Sobolev inequalities for to arbitrary (Cao et al., 2018).
4. Quantum Generalizations: Sandwiched Rényi Gradients
For quantum states (faithful density operators) and in a finite-dimensional , the "sandwiched" Rényi -divergence is
The gradient with respect to is
where , (Takahashi et al., 2016). For , this recovers the quantum relative entropy gradient . Analogous results hold for gradient flows of sandwiched Rényi divergence under GNS-detailed-balance Lindblad semigroups, where the Lindblad equation is proven to be the gradient flow of the divergence with respect to a non-commutative Otto-Wasserstein-like metric (Cao et al., 2018).
5. Applications in Variational Inference and Learning
In exponential families, the gradient of the Rényi divergence with respect to the natural parameter is
where is the expectation of the sufficient statistics under , and is the -geometric average density (Guilmeau et al., 2022). This is used in Rényi-divergence minimization via Bregman proximal gradient algorithms, which interpolate between standard moment-matching and "geometric" averages. This relaxed update enables robust inference and provable convergence rates.
In deep learning, e.g., Deep Mutual Learning (DML) with Rényi divergence (Huang et al., 2022), the gradient w.r.t. network parameters is
with the tilted "peer-regularization" distribution. This structure generalizes cross-entropy and is tunable via , with limiting behavior reducing to KL-based mutual learning as .
6. Quantum Machine Learning and Barren Plateau Avoidance
The maximal Rényi divergence of order two for quantum states,
exhibits gradients
This unboundedness circumvents gradient vanishing ("barren plateau") phenomena prevalent with bounded, linear cost functions in QNN training (Kieferova et al., 2021).
7. Control and Reinforcement Learning
In risk-sensitive control viewed as inference, Rényi divergence gradients drive new classes of policy-gradient methods. For variational objectives parametrized by risk-sensitivity parameter (with ), the policy gradients are
with , and for policy update steps, weights interpolate between mass-covering () and zero-forcing () behaviors. Actor-critic updates involve similarly explicit Rényi-gradient terms, interpolating MaxEnt SAC and its risk-sensitive generalizations (Ito et al., 4 Nov 2024).
References
- "Exponential decay of Rényi divergence under Fokker-Planck equations" (Cao et al., 2018)
- "Information geometry of sandwiched Rényi -divergence" (Takahashi et al., 2016)
- "Logarithmic divergences from optimal transport and Rényi geometry" (Wong, 2017)
- "Rényi Divergence Deep Mutual Learning" (Huang et al., 2022)
- "Quantum Generative Training Using Rényi Divergences" (Kieferova et al., 2021)
- "Gradient flow structure and exponential decay of the sandwiched Rényi divergence for primitive Lindblad equations with GNS-detailed balance" (Cao et al., 2018)
- "Regularized Rényi divergence minimization through Bregman proximal gradient algorithms" (Guilmeau et al., 2022)
- "Risk-sensitive control as inference with Rényi divergence" (Ito et al., 4 Nov 2024)