Papers
Topics
Authors
Recent
2000 character limit reached

Rényi Divergence Gradient: Theory & Applications

Updated 1 December 2025
  • Rényi divergence gradients are explicit derivative formulations quantifying differences between probability distributions using tilted and weighted expectations.
  • They underpin optimization techniques across classical statistics, variational inference, quantum information, and control through structured gradient flows and geometric insights.
  • Their advanced formulation extends conventional divergences like the KL divergence, offering robust convergence properties in applications from deep learning to risk-sensitive policy updates.

Rényi divergence gradients underpin a family of information-theoretic optimization methods used in classical statistics, information geometry, quantum information theory, variational inference, machine learning, and control. The explicit forms, computational characteristics, and geometric structures of these gradients are central to theoretical analysis and algorithmic development across these fields.

1. Definition and General Formulas

The Rényi divergence of order α1\alpha\neq 1, between probability measures pp and qq on a common domain, is

Dα(pq)=1α1log(p(x)q(x))αq(x)dxD_\alpha(p \| q) = \frac{1}{\alpha - 1} \log \int \left( \frac{p(x)}{q(x)} \right)^\alpha q(x) dx

In the limit α1\alpha\to 1, this reduces to the Kullback-Leibler divergence. The gradient of Dα(pθq)D_\alpha(p_\theta\|q) with respect to parameters θ\theta of a parametric distribution pθ(x)p_\theta(x) takes the generic form

θDα(pθq)=αα1Erθ[θlogpθ(x)]\nabla_\theta D_\alpha(p_\theta\|q) = \frac{\alpha}{\alpha-1} \mathbb{E}_{r_\theta}\left[\nabla_\theta \log p_\theta(x)\right]

where rθ(x)r_\theta(x) is the "tilted" distribution proportional to pθ(x)αq(x)1αp_\theta(x)^\alpha q(x)^{1-\alpha}, i.e.,

rθ(x)=pθ(x)αq(x)1αpθ(x)αq(x)1αdxr_\theta(x) = \frac{p_\theta(x)^\alpha q(x)^{1-\alpha}}{\int p_\theta(x)^\alpha q(x)^{1-\alpha} dx}

An equivalent representation brings the expectation under pθp_\theta with non-linear weights

wθ(x)=pθ(x)α1q(x)1αpθ(x)α1q(x)1αdxw_\theta(x) = \frac{p_\theta(x)^{\alpha-1}q(x)^{1-\alpha}}{\int p_\theta(x')^{\alpha-1}q(x')^{1-\alpha}dx'}

so that

θDα(pθq)=αα1Epθ[wθ(x)θlogpθ(x)]\nabla_\theta D_\alpha(p_\theta\|q) = \frac{\alpha}{\alpha-1} \mathbb{E}_{p_\theta}\left[w_\theta(x) \nabla_\theta \log p_\theta(x)\right]

These formulas provide the foundation for Rényi-based optimization in both continuous and discrete domains (Ito et al., 4 Nov 2024).

2. Discrete and Geometric Interpretations

For discrete probability vectors p,qΔdp, q \in \Delta_{d}, the gradient with respect to pp is

[pDα(pq)]k=αα1pkα1qk1αZ\left[\nabla_p D_\alpha(p \| q)\right]_k = \frac{\alpha}{\alpha-1} \frac{p_k^{\alpha-1}q_k^{1-\alpha}}{Z}

where Z=ipiαqi1αZ=\sum_i p_i^\alpha q_i^{1-\alpha} (Wong, 2017). In vector notation, this is

pDα(pq)=αα1p(α1)q(1α)Z\nabla_p D_\alpha(p\|q) = \frac{\alpha}{\alpha-1} \frac{p^{\odot(\alpha-1)} \odot q^{\odot(1-\alpha)}}{Z}

This view aligns with the geometry of statistical manifolds: Rényi divergence gradients are associated with canonical divergences on dually flat or constant curvature manifolds, connecting to optimal transport and Bregman geometry (Wong, 2017). Flows induced by these gradients follow geodesics defined by the underlying Riemannian metric.

3. Fokker-Planck Equations and Gradient-Flow Structure

On Rd\mathbb{R}^d, for the Fokker-Planck PDE with drift V(x)\nabla V(x) and strictly convex potential VV, the time-derivative (gradient flow) of Rényi divergence Dα(ptp)D_\alpha(p_t\|p_\infty) along the solution ptp_t has the form

ddtDα(ptp)=1αΦt(x)2dμt(x)=Iα(ptp)\frac{d}{dt} D_\alpha(p_t\|p_\infty) = -\frac{1}{\alpha} \int |\nabla \Phi_t(x)|^2 d\mu_t(x) = - I_\alpha(p_t \| p_\infty)

where Φt=αϕt+(α1)Dα(ptp)\Phi_t = \alpha \phi_t + (\alpha-1) D_\alpha(p_t\|p_\infty), ϕt=log(pt/p)\phi_t = -\log(p_t/p_\infty), and μt\mu_t is the "escort" distribution. IαI_\alpha is the relative α\alpha-Fisher information. This framework enables exponential decay rates for the Rényi divergence, extending log-Sobolev inequalities for α=1\alpha=1 to arbitrary α>0\alpha>0 (Cao et al., 2018).

4. Quantum Generalizations: Sandwiched Rényi Gradients

For quantum states (faithful density operators) ρ\rho and σ\sigma in a finite-dimensional H\mathcal{H}, the "sandwiched" Rényi α\alpha-divergence is

Dα(ρσ)=1α1logTr[σ1α2αρσ1α2α]αD_\alpha(\rho\|\sigma) = \frac{1}{\alpha-1}\log\operatorname{Tr}[\sigma^{\frac{1-\alpha}{2\alpha}}\rho\,\sigma^{\frac{1-\alpha}{2\alpha}}]^\alpha

The gradient with respect to ρ\rho is

ρDα(ρσ)=1(α1)Tr(Aα)σβAα1σβ\nabla_\rho D_\alpha(\rho\|\sigma) = \frac{1}{(\alpha-1)\operatorname{Tr}(A^\alpha)}\,\sigma^\beta A^{\alpha-1} \sigma^\beta

where A=σβρσβA = \sigma^\beta \rho \sigma^\beta, β=(1α)/2α\beta = (1-\alpha)/2\alpha (Takahashi et al., 2016). For α1\alpha\to 1, this recovers the quantum relative entropy gradient logρlogσ\log\rho - \log\sigma. Analogous results hold for gradient flows of sandwiched Rényi divergence under GNS-detailed-balance Lindblad semigroups, where the Lindblad equation is proven to be the gradient flow of the divergence with respect to a non-commutative Otto-Wasserstein-like metric (Cao et al., 2018).

5. Applications in Variational Inference and Learning

In exponential families, the gradient of the Rényi divergence with respect to the natural parameter θ\theta is

θDα(qθπ)=qθ(Γ)πθ(α)(Γ)\nabla_\theta D_\alpha(q_\theta\|\pi) = q_\theta(\Gamma) - \pi^{(\alpha)}_\theta(\Gamma)

where qθ(Γ)q_\theta(\Gamma) is the expectation of the sufficient statistics under qθq_\theta, and πθ(α)\pi^{(\alpha)}_\theta is the α\alpha-geometric average density (Guilmeau et al., 2022). This is used in Rényi-divergence minimization via Bregman proximal gradient algorithms, which interpolate between standard moment-matching and "geometric" averages. This relaxed update enables robust inference and provable convergence rates.

In deep learning, e.g., Deep Mutual Learning (DML) with Rényi divergence (Huang et al., 2022), the gradient w.r.t. network parameters θk\theta_k is

θkDα(PjQk)=μRjk(μ)θklogQk(μ)\nabla_{\theta_k} D_\alpha(P_j\|Q_k) = -\sum_\mu R_{j\rightarrow k}(\mu)\, \nabla_{\theta_k}\log Q_k(\mu)

with Rjk(μ)R_{j\rightarrow k}(\mu) the tilted "peer-regularization" distribution. This structure generalizes cross-entropy and is tunable via α\alpha, with limiting behavior reducing to KL-based mutual learning as α1\alpha \to 1.

6. Quantum Machine Learning and Barren Plateau Avoidance

The maximal Rényi divergence of order two for quantum states,

D~2(ρσ)=logTr[ρ2σ1]\widetilde D_2(\rho\|\sigma) = \log \operatorname{Tr}[\rho^2 \sigma^{-1}]

exhibits gradients

θD~2(ρσ(θ))=Tr[ρ2σ1(θσ)σ1]Tr[ρ2σ1]\frac{\partial}{\partial\theta} \widetilde D_2(\rho\|\sigma(\theta)) = - \frac{\operatorname{Tr}[\rho^2 \sigma^{-1} (\partial_\theta \sigma)\sigma^{-1}]}{\operatorname{Tr}[\rho^2 \sigma^{-1}]}

This unboundedness circumvents gradient vanishing ("barren plateau") phenomena prevalent with bounded, linear cost functions in QNN training (Kieferova et al., 2021).

7. Control and Reinforcement Learning

In risk-sensitive control viewed as inference, Rényi divergence gradients drive new classes of policy-gradient methods. For variational objectives parametrized by risk-sensitivity parameter η\eta (with α=1+η\alpha=1+\eta), the policy gradients are

θL(πθ)=1+ηηErθ[θlogpθπ(τ)]\nabla_\theta \mathcal{L}(\pi_\theta) = \frac{1+\eta}{\eta} \mathbb{E}_{r_\theta}\left[\nabla_\theta\log p^\pi_\theta(\tau)\right]

with rθ(τ)pθπ(τ)1+ηp(τ,O)ηr_\theta(\tau) \propto p^\pi_\theta(\tau)^{1+\eta} p(\tau, O)^{-\eta}, and for policy update steps, weights wθ(x)w_\theta(x) interpolate between mass-covering (η<0\eta < 0) and zero-forcing (η>0\eta > 0) behaviors. Actor-critic updates involve similarly explicit Rényi-gradient terms, interpolating MaxEnt SAC and its risk-sensitive generalizations (Ito et al., 4 Nov 2024).


References

  • "Exponential decay of Rényi divergence under Fokker-Planck equations" (Cao et al., 2018)
  • "Information geometry of sandwiched Rényi αα-divergence" (Takahashi et al., 2016)
  • "Logarithmic divergences from optimal transport and Rényi geometry" (Wong, 2017)
  • "Rényi Divergence Deep Mutual Learning" (Huang et al., 2022)
  • "Quantum Generative Training Using Rényi Divergences" (Kieferova et al., 2021)
  • "Gradient flow structure and exponential decay of the sandwiched Rényi divergence for primitive Lindblad equations with GNS-detailed balance" (Cao et al., 2018)
  • "Regularized Rényi divergence minimization through Bregman proximal gradient algorithms" (Guilmeau et al., 2022)
  • "Risk-sensitive control as inference with Rényi divergence" (Ito et al., 4 Nov 2024)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Rényi Divergence Gradient.