Information-Theoretic f-Divergences
- Information-theoretic f-divergences are defined via convex functions, providing a unified metric to quantify dissimilarity between probability distributions.
- They encompass classical divergences such as KL, total variation, Hellinger, and extend to settings like quantum operations and Markov chains with sharp inequality bounds.
- Recent advances include transport and mixed f-divergences with variational formulations that enhance applications in privacy, generative modeling, and statistical inference.
An f-divergence is a central object in information theory, statistics, geometry, and quantum information, providing a unifying formalism for quantifying dissimilarity between probability distributions or operators. Originating from convex analysis, classical f-divergences include the Kullback–Leibler divergence, total variation, Hellinger, χ², Rényi, and Jensen–Shannon divergences. The framework extends to Markov chains, quantum operations, log-concave functions, and convex bodies, and underlies recent advances in statistical learning, privacy, and variational inference.
1. General Theory of f-Divergences
Let and be probability measures on a measurable space, with . For a convex function with , the f-divergence is
or, on a finite set, . The conjugate function gives . Classical f-divergences arise as specific choices:
- Kullback–Leibler:
- Total variation: 0
- Pearson–χ²: 1
- Hellinger: 2
- Rényi 3-divergences: 4 for 5 Symmetric f-divergences satisfy 6 (Sason, 2015).
Key properties include nonnegativity, convexity in each argument (jointly for 7), and the data-processing inequality.
2. Inequalities and Joint Range between f-Divergences
An extensive body of work has developed sharp inequalities linking different f-divergences, often expressed in terms of the total variation or other symmetric distances.
- Pinsker’s inequality: 8, with the constant being best possible (Harremoës et al., 2010, Guntuboyina et al., 2013, 0903.1765, Sason et al., 2016). The inequality is saturated by two-point distributions and generalizes across other f-divergences using joint range theory (Harremoës et al., 2010).
- Functional domination: If 9 for all 0, then 1, with tightness achieved via maximizing over two-point supports. The best constant is 2 (Sason et al., 2016).
- Symmetric divergence bounds: For Jensen–Shannon, Hellinger, Bhattacharyya, and Chernoff distances, sharp lower and upper bounds in terms of total variation are established and attained by extremal two- or three-point distributions (Sason, 2015).
- Integral relations: KL divergence and χ²-divergence are linked via
3
and related two-sided bounds (Nishiyama et al., 2020).
These inequalities furnish universal lower bounds and enable tight sandwich relations between information-theoretic measures.
3. Geometric, Transport, and Mixed f-Divergence Variants
Beyond classical settings, f-divergence theory extends to broader geometric and analytic contexts.
Transport f-divergences (Li, 22 Apr 2025) measure the difference between one-dimensional densities via optimal transport: 4 where 5 is the monotone map pushing 6 to 7. These objects retain invariance and convexity properties and have dual variational characterizations using the Legendre transform of the associated generator.
Mixed f-divergence generalizes to 8-tuples of density pairs (or log-concave functions), via geometric means of individual divergences. For log-concave 9, the mixed f-divergence is
0
and satisfies affine invariance, Alexandrov–Fenchel type inequalities, and sharp isoperimetric bounds (Caglar et al., 2014). For convex bodies, the f-divergence of cone-measures on the boundary encodes and extends classical affine surface area invariants and isoperimetric inequalities in geometry (Werner, 2012).
4. Quantum and Markov Chain f-Divergences
Quantum f-divergences (Hiai et al., 2010, Hiai et al., 2016) generalize classical f-divergences to density operators on a finite-dimensional Hilbert space. The Petz f-divergence is defined as
1
with other variants including maximal (Matsumoto), measured, and sandwiched Rényi divergences. For operator-convex 2, Petz’s f-divergence satisfies monotonicity under quantum operations and realizes equality (i.e., reversibility) when data processing is saturated.
Markov chain f-divergences (Wang et al., 2023) adapt the classical framework to transition matrices 3 of a Markov chain with a reference measure 4 via
5
Rényi-type, total variation, χ², and Hellinger divergences are all included. Pinsker-type inequalities, Chernoff information, and Pythagorean identities are developed in this setting, with explicit applications to mixing rates, spectral gap bounds, and hypothesis testing of Markov processes.
5. Variational Representations and Computational Schemes
Convex duality underpins the use of f-divergences in statistical learning and inference. A generic variational form is
6
where 7 is the Legendre–Fenchel dual. In generative adversarial learning (f-GAN), this underlies the saddle-point formulation for all f-divergences (Shannon, 2020, Terjék, 2021). Moreau–Yosida regularization introduces an optimization over approximating distributions with Wasserstein penalties, yielding adaptive Lipschitz control for GAN critics (Terjék, 2021).
In variational inference, f-divergence minimization sharpens or generalizes the classical ELBO and sandwich estimators, allowing for surrogate objectives with broader robustness and tail properties (Wan et al., 2020).
Sum-of-squares and spectral (quantum) relaxations yield efficient convex optimization schemes for learning under moment constraints, variational normalization, and operator inference, with complexity guarantees and practical performance on multivariate polynomial and Boolean models (Bach, 2022).
6. Applications: Privacy, Information Geometry, and Statistical Mechanics
f-divergences are key to fine-grained privacy guarantees—such as f-privacy, which implies probabilistic information privacy and differential privacy bounds via explicit δ(η,ε) relations for KL, TV, and χ² divergences (Wang et al., 2023). The natural hierarchy of these divergences, with χ²-privacy being the “strongest,” underlies tradeoffs in privacy-utility optimization.
In inference, minimizing power-law oriented divergences (Tsallis) instead of KL yields power-law posterior forms, exhibits partial Shore–Johnson axiom satisfaction, and models heavy-tailed distributions naturally—at the loss of additive system independence and a simple Pythagorean property except in the Shannon limit (Vachery et al., 2012).
In information geometry, the joint range methodology precisely characterizes attainable pairs of divergence values and underlies tight universal (two-point) extremality results linking different f-divergences (Harremoës et al., 2010).
In convex geometry, f-divergences unify all classical affine surface areas, functionally and for bodies, giving rise to affine invariant valuations and isoperimetric inequalities, with extensions to mixed and Orlicz contexts (Werner, 2012, Caglar et al., 2014).
7. Recent Extensions and Future Directions
Recent work defines broad generalizations: interpolations with integral probability metrics (as in 8-divergences) that subsume both f-divergences and Wasserstein distances by introducing transport function classes (Birrell et al., 2020). This two-stage mass-redistribution/mass-transport picture enables robust learning and statistical estimation beyond absolute continuity, with improved concavity for adversarial optimization in large-scale settings.
Further advances include transport f-divergences in one dimension (Li, 22 Apr 2025), with open problems in extending to higher dimension and matrix-valued counterparts, the development of accurate estimators and variational functionals in stochastic process and random field contexts, and the explicit characterization of equality and reversibility conditions in quantum and Markovian frameworks.
These developments position f-divergence theory at the intersection of convex analysis, geometry, statistical learning, privacy, information theory, and quantum information science, driving both theoretical advances and practical algorithms.