Rényi Divergence and Kullback-Leibler Divergence (1206.2459v2)

Published 12 Jun 2012 in cs.IT, math.IT, math.ST, stat.ML, and stat.TH

Abstract: R\'enyi divergence is related to R\'enyi entropy much like Kullback-Leibler divergence is related to Shannon's entropy, and comes up in many settings. It was introduced by R\'enyi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the R\'enyi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of R\'enyi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of $\sigma$-algebras and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also show how to generalize the Pythagorean inequality to orders different from 1, and we extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders) and present several other minimax results.

Citations (1,260)

View on Semantic Scholar

Summary

The paper introduces Rényi divergence as a generalization of KL divergence by parameterizing divergence measures between probability distributions.
It demonstrates key properties including convexity, continuity, and the data processing inequality that underpin applications in channel capacity and hypothesis testing.
The study connects Rényi divergence with related metrics like Hellinger distance and Chernoff information, offering insights for advancements in statistical learning and communications.

An Essay on: Rényi Divergence and Kullback-Leibler Divergence

Introduction

In information theory, the concepts of Shannon entropy and Kullback-Leibler (KL) divergence are foundational. The paper by Tim van Erven and Peter Harremoës extends our understanding by presenting a detailed analysis of the Rényi divergence (RD), which generalizes KL divergence with a parameter, $\alpha$ , that specifies the order of the divergence. Rényi divergence, also linked to Rényi entropy, finds applicability in a wide range of theoretical and practical settings. This summary encapsulates the significant properties, results, and implications from the paper.

Definitions and Basic Properties

Rényi divergence of order $\alpha \neq 1$ between two probability distributions $P$ and $Q$ is defined as:

$D_\alpha(P\|Q) = \frac{1}{\alpha-1} \ln \int p^\alpha q^{1-\alpha} d\mu,$

where $p$ and $q$ are the densities of $P$ and $Q$ with respect to a common reference measure $\mu$ . Notably, when $\alpha=1$ , RD equals the KL divergence. The authors extend this definition to continuous spaces and illustrate that RD can be defined consistently via discretizations. For specific values of $\alpha$ , RD links to other divergence measures such as Hellinger distance ( $\alpha=\frac{1}{2}$ ) and $\chi^2$ -divergence ( $\alpha=2$ ).

Key Theoretical Results

The paper explores diverse attributes of RD and KL divergence:

Convexity and Continuity: Rényi divergence is shown to possess various convexity properties. For orders $\alpha \in [0,1]$ , it is jointly convex in its arguments. Convexity in the second argument holds for all $\alpha \geq 0$ , while for $\alpha \in (0,1)$ , RD is uniformly continuous with respect to the total variation topology.
Minimax Redundancy and Channel Capacity: Extending previous work, the authors show that for finite state spaces, channel capacity $C_\alpha$ equals the minimax redundancy $R_\alpha$ for any $\alpha \in [0,\infty]$ . This establishes Rényi divergence's pivotal role in information theory, where channel capacity is regarded as a critical measure of the maximum achievable rate of reliable communication.
Pythagorean Inequality: A generalized Pythagorean inequality for RD is derived, showing that RD can be decomposed into two non-negative terms, akin to KL divergence's decomposability, given $\alpha$ -convex sets of distributions.
Data Processing Inequality: A central result is the data processing inequality, which stipulates that processing data cannot increase RD for $\alpha \in [0,\infty]$ . This is crucial for many applications including hypothesis testing and learning theory.
Chernoff Information: An insightful result is the connection between RD and Chernoff information, particularly for hypothesis testing. The paper shows that $(1-\alpha)D_\alpha(P\|Q)$ equals the cumulant generating function for the random variable $\ln(p/q)$ under $Q$ , a connection that bridges KL divergence, RD, and hypothesis testing.

Extensions and Special Cases

The authors take a meticulous approach to extending RD to orders $\alpha \in [-\infty,0)$ . They show that while negative orders can be considered, they often have properties opposite to those of positive orders. Skew symmetry is one such property, linking RD for $\alpha$ to RD for $1-\alpha$ .

Practical Implications and Future Directions

The implications of this research are extensive in both theoretical and applied domains. Practical applications include improving algorithms in statistical learning, enhancing methods in hypothesis testing, and fine-tuning models in predictive analytics. The relation of RD to total variation distance (as presented in Gilardoni's extension of Pinsker's inequality) further highlights its utility across numerous fields involving statistical divergence measures.

Conclusion

This comprehensive treatment of Rényi divergence by van Erven and Harremoës enriches the theoretical landscape by establishing new properties and extending classical results like those related to KL divergence. Their contribution is poised to stimulate further research into applications of RD across a broad spectrum of disciplines, including machine learning, communications, and information theory.

References

Refer to the bibliography in the original paper for the detailed list of references and further reading on various subtopics discussed.

PDF Markdown