Rényi Divergence Gradient: Theory & Applications

Updated 1 December 2025

Rényi divergence gradients are explicit derivative formulations quantifying differences between probability distributions using tilted and weighted expectations.
They underpin optimization techniques across classical statistics, variational inference, quantum information, and control through structured gradient flows and geometric insights.
Their advanced formulation extends conventional divergences like the KL divergence, offering robust convergence properties in applications from deep learning to risk-sensitive policy updates.

Rényi divergence gradients underpin a family of information-theoretic optimization methods used in classical statistics, information geometry, quantum information theory, variational inference, machine learning, and control. The explicit forms, computational characteristics, and geometric structures of these gradients are central to theoretical analysis and algorithmic development across these fields.

1. Definition and General Formulas

The Rényi divergence of order $\alpha\neq 1$ , between probability measures $p$ and $q$ on a common domain, is

$D_\alpha(p \| q) = \frac{1}{\alpha - 1} \log \int \left( \frac{p(x)}{q(x)} \right)^\alpha q(x) dx$

In the limit $\alpha\to 1$ , this reduces to the Kullback-Leibler divergence. The gradient of $D_\alpha(p_\theta\|q)$ with respect to parameters $\theta$ of a parametric distribution $p_\theta(x)$ takes the generic form

$\nabla_\theta D_\alpha(p_\theta\|q) = \frac{\alpha}{\alpha-1} \mathbb{E}_{r_\theta}\left[\nabla_\theta \log p_\theta(x)\right]$

where $r_\theta(x)$ is the "tilted" distribution proportional to $p_\theta(x)^\alpha q(x)^{1-\alpha}$ , i.e.,

$r_\theta(x) = \frac{p_\theta(x)^\alpha q(x)^{1-\alpha}}{\int p_\theta(x)^\alpha q(x)^{1-\alpha} dx}$

An equivalent representation brings the expectation under $p_\theta$ with non-linear weights

$w_\theta(x) = \frac{p_\theta(x)^{\alpha-1}q(x)^{1-\alpha}}{\int p_\theta(x')^{\alpha-1}q(x')^{1-\alpha}dx'}$

so that

$\nabla_\theta D_\alpha(p_\theta\|q) = \frac{\alpha}{\alpha-1} \mathbb{E}_{p_\theta}\left[w_\theta(x) \nabla_\theta \log p_\theta(x)\right]$

These formulas provide the foundation for Rényi-based optimization in both continuous and discrete domains (Ito et al., 4 Nov 2024).

2. Discrete and Geometric Interpretations

For discrete probability vectors $p, q \in \Delta_{d}$ , the gradient with respect to $p$ is

$\left[\nabla_p D_\alpha(p \| q)\right]_k = \frac{\alpha}{\alpha-1} \frac{p_k^{\alpha-1}q_k^{1-\alpha}}{Z}$

where $Z=\sum_i p_i^\alpha q_i^{1-\alpha}$ (Wong, 2017). In vector notation, this is

$\nabla_p D_\alpha(p\|q) = \frac{\alpha}{\alpha-1} \frac{p^{\odot(\alpha-1)} \odot q^{\odot(1-\alpha)}}{Z}$

This view aligns with the geometry of statistical manifolds: Rényi divergence gradients are associated with canonical divergences on dually flat or constant curvature manifolds, connecting to optimal transport and Bregman geometry (Wong, 2017). Flows induced by these gradients follow geodesics defined by the underlying Riemannian metric.

3. Fokker-Planck Equations and Gradient-Flow Structure

On $\mathbb{R}^d$ , for the Fokker-Planck PDE with drift $\nabla V(x)$ and strictly convex potential $V$ , the time-derivative (gradient flow) of Rényi divergence $D_\alpha(p_t\|p_\infty)$ along the solution $p_t$ has the form

$\frac{d}{dt} D_\alpha(p_t\|p_\infty) = -\frac{1}{\alpha} \int |\nabla \Phi_t(x)|^2 d\mu_t(x) = - I_\alpha(p_t \| p_\infty)$

where $\Phi_t = \alpha \phi_t + (\alpha-1) D_\alpha(p_t\|p_\infty)$ , $\phi_t = -\log(p_t/p_\infty)$ , and $\mu_t$ is the "escort" distribution. $I_\alpha$ is the relative $\alpha$ -Fisher information. This framework enables exponential decay rates for the Rényi divergence, extending log-Sobolev inequalities for $\alpha=1$ to arbitrary $\alpha>0$ (Cao et al., 2018).

4. Quantum Generalizations: Sandwiched Rényi Gradients

For quantum states (faithful density operators) $\rho$ and $\sigma$ in a finite-dimensional $\mathcal{H}$ , the "sandwiched" Rényi $\alpha$ -divergence is

$D_\alpha(\rho\|\sigma) = \frac{1}{\alpha-1}\log\operatorname{Tr}[\sigma^{\frac{1-\alpha}{2\alpha}}\rho\,\sigma^{\frac{1-\alpha}{2\alpha}}]^\alpha$

The gradient with respect to $\rho$ is

$\nabla_\rho D_\alpha(\rho\|\sigma) = \frac{1}{(\alpha-1)\operatorname{Tr}(A^\alpha)}\,\sigma^\beta A^{\alpha-1} \sigma^\beta$

where $A = \sigma^\beta \rho \sigma^\beta$ , $\beta = (1-\alpha)/2\alpha$ (Takahashi et al., 2016). For $\alpha\to 1$ , this recovers the quantum relative entropy gradient $\log\rho - \log\sigma$ . Analogous results hold for gradient flows of sandwiched Rényi divergence under GNS-detailed-balance Lindblad semigroups, where the Lindblad equation is proven to be the gradient flow of the divergence with respect to a non-commutative Otto-Wasserstein-like metric (Cao et al., 2018).

5. Applications in Variational Inference and Learning

In exponential families, the gradient of the Rényi divergence with respect to the natural parameter $\theta$ is

$\nabla_\theta D_\alpha(q_\theta\|\pi) = q_\theta(\Gamma) - \pi^{(\alpha)}_\theta(\Gamma)$

where $q_\theta(\Gamma)$ is the expectation of the sufficient statistics under $q_\theta$ , and $\pi^{(\alpha)}_\theta$ is the $\alpha$ -geometric average density (Guilmeau et al., 2022). This is used in Rényi-divergence minimization via Bregman proximal gradient algorithms, which interpolate between standard moment-matching and "geometric" averages. This relaxed update enables robust inference and provable convergence rates.

In deep learning, e.g., Deep Mutual Learning (DML) with Rényi divergence (Huang et al., 2022), the gradient w.r.t. network parameters $\theta_k$ is

$\nabla_{\theta_k} D_\alpha(P_j\|Q_k) = -\sum_\mu R_{j\rightarrow k}(\mu)\, \nabla_{\theta_k}\log Q_k(\mu)$

with $R_{j\rightarrow k}(\mu)$ the tilted "peer-regularization" distribution. This structure generalizes cross-entropy and is tunable via $\alpha$ , with limiting behavior reducing to KL-based mutual learning as $\alpha \to 1$ .

6. Quantum Machine Learning and Barren Plateau Avoidance

The maximal Rényi divergence of order two for quantum states,

$\widetilde D_2(\rho\|\sigma) = \log \operatorname{Tr}[\rho^2 \sigma^{-1}]$

exhibits gradients

$\frac{\partial}{\partial\theta} \widetilde D_2(\rho\|\sigma(\theta)) = - \frac{\operatorname{Tr}[\rho^2 \sigma^{-1} (\partial_\theta \sigma)\sigma^{-1}]}{\operatorname{Tr}[\rho^2 \sigma^{-1}]}$

This unboundedness circumvents gradient vanishing ("barren plateau") phenomena prevalent with bounded, linear cost functions in QNN training (Kieferova et al., 2021).

7. Control and Reinforcement Learning

In risk-sensitive control viewed as inference, Rényi divergence gradients drive new classes of policy-gradient methods. For variational objectives parametrized by risk-sensitivity parameter $\eta$ (with $\alpha=1+\eta$ ), the policy gradients are

$\nabla_\theta \mathcal{L}(\pi_\theta) = \frac{1+\eta}{\eta} \mathbb{E}_{r_\theta}\left[\nabla_\theta\log p^\pi_\theta(\tau)\right]$

with $r_\theta(\tau) \propto p^\pi_\theta(\tau)^{1+\eta} p(\tau, O)^{-\eta}$ , and for policy update steps, weights $w_\theta(x)$ interpolate between mass-covering ( $\eta < 0$ ) and zero-forcing ( $\eta > 0$ ) behaviors. Actor-critic updates involve similarly explicit Rényi-gradient terms, interpolating MaxEnt SAC and its risk-sensitive generalizations (Ito et al., 4 Nov 2024).

References

"Exponential decay of Rényi divergence under Fokker-Planck equations" (Cao et al., 2018)
"Information geometry of sandwiched Rényi $α$ -divergence" (Takahashi et al., 2016)
"Logarithmic divergences from optimal transport and Rényi geometry" (Wong, 2017)
"Rényi Divergence Deep Mutual Learning" (Huang et al., 2022)
"Quantum Generative Training Using Rényi Divergences" (Kieferova et al., 2021)
"Gradient flow structure and exponential decay of the sandwiched Rényi divergence for primitive Lindblad equations with GNS-detailed balance" (Cao et al., 2018)
"Regularized Rényi divergence minimization through Bregman proximal gradient algorithms" (Guilmeau et al., 2022)
"Risk-sensitive control as inference with Rényi divergence" (Ito et al., 4 Nov 2024)