Relative Information Entropy

Updated 28 May 2026

Relative Information Entropy is a measure that quantifies the difference between two probability distributions, also known as the Kullback–Leibler divergence.
It underpins statistical inference and operational methods by linking information theory with hypothesis testing, coding, and signal processing, driving efficiency in model evaluation.
In physics and quantum mechanics, its extensions capture inhomogeneities and distinguish quantum states while addressing key challenges in cosmology and dynamical systems.

Relative information entropy, also known as the Kullback–Leibler (KL) divergence, is a fundamental functional quantifying the distinguishability between two probability distributions or, more generally, between two quantum states, density fields, or other types of measures. Originating in information theory, it serves as the central measure of information loss, surprise, or inefficiency when a true distribution is replaced by an alternative hypothesis. The functional exhibits deep connections with statistical inference, coding theory, thermodynamics, learning theory, quantum information, and modern mathematical descriptions of complexity, inhomogeneity, and structure formation.

1. Formal Definition and Core Properties

Let $P$ and $Q$ be two probability distributions on a finite or countable set $X$ . The relative information entropy is defined as

$D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$

In the case of absolutely continuous distributions with densities $p(x)$ and $q(x)$ ,

$D(p\Vert q) = \int p(x) \log \frac{p(x)}{q(x)} dx.$

The KL divergence is always non-negative ( $D(P\Vert Q)\ge0$ ), vanishing if and only if $P=Q$ almost everywhere. In asymmetric contexts, $D(P\Vert Q)\neq D(Q\Vert P)$ and it is unbounded above, diverging when the support of $Q$ 0 is not contained in that of $Q$ 1 (0808.4111, Shore, 2013).

Crucial mathematical features include:

Non-negativity (Gibbs’ inequality): $Q$ 2 iff $Q$ 3.
Additivity for product distributions: $Q$ 4.
Chain rule: For joint distributions, $Q$ 5.
Convexity: $Q$ 6 is jointly convex in $Q$ 7.
Data-processing inequality: $Q$ 8 for any stochastic map $Q$ 9.

These properties are axiomatized: the minimal requirements are monotonicity under stochastic maps, additivity, and normalization. Every divergence satisfying these admits representation in terms of underlying entropic functionals (Gour et al., 2020).

2. Statistical and Operational Interpretations

Relative information entropy underpins statistical hypothesis testing, estimation, coding, and learning. In hypothesis testing, $X$ 0 governs the exponential rate at which the probability of error decays (e.g., in likelihood-ratio tests or Chernoff bounds). For model selection, $X$ 1 quantifies the loss incurred by using $X$ 2 instead of the empirical distribution $X$ 3; it is the basis for information criteria (AIC, BIC, MDL) and the EM algorithm’s alternated minimization (0808.4111).

In inference and learning theory, Bayesian updating can be recast as minimum relative entropy (MRE) projection. The MRE principle chooses the distribution within a constraint set $X$ 4 that is closest to a prior $X$ 5 in the sense of KL divergence, forming the foundation of the maximum entropy and maximum likelihood methodologies. Applications extend to expert system updating, pattern classification, spectral estimation, and beyond (Shore, 2013).

In signal processing, especially within the Poisson and Gaussian channels, KL divergence arises as the total excess estimation loss due to a model mismatch, explicitly linking information-theoretic and estimation-theoretic quantities through integral identities (Atar et al., 2010).

3. Relative Entropy in Physics and Cosmology

Relative entropy has been extensively adapted to describe inhomogeneity and information growth in dynamical and gravitational systems. In cosmology, it quantifies the deviation of a local inhomogeneous density field $X$ 6 from its spatial average over a compact domain $X$ 7: $X$ 8 This is precisely the KL divergence between the density field and its spatial mean. In structure-forming universes, this functional is generally monotonically increasing in time, reflecting the irreversible generation of inhomogeneities and gravitational entropy. Its growth rate is tied to the non-commutativity of spatial averaging and evolution, and the backreaction sourced by kinematic inhomogeneities (Morita et al., 2010, Li et al., 2012).

Comparing spacetimes, relative volume entropy is defined between the normalized volume densities of two metrics, giving a lower bound on the number of bits required to specify one geometry relative to another, reflecting descriptive complexity and inhomogeneity (Akerblom et al., 2010).

Non-additive generalizations, such as the Tsallis and Rényi relative entropies, are introduced to account for gravitational coupling between causally connected domains, with the Tsallis divergence parameterizing mutual information induced by long-range interactions and the Rényi divergence isolating additive, independent information content in a domain (Czinner et al., 2016).

4. Quantum and Generalized Relative Entropy

In quantum information theory, the Umegaki quantum relative entropy

$X$ 9

measures the distinguishability of two density operators. It obeys key properties—strong subadditivity, joint convexity, and monotonicity under completely positive trace-preserving (CPTP) maps, collectively forming the backbone of operational quantum information theory.

Monotonicity is central for entropy inequalities, and sharp refinements using recovery maps (rotated Petz map) offer quantitative estimates of how well information lost under CPTP maps can be recovered. The data-processing inequality for quantum relative entropy is intricately linked to quantum error correction and thermodynamics (Berta et al., 2014).

Metric generalizations have been developed to address limitations of the KL divergence: symmetrized, finite-range, and truly metric divergences extend the information-theoretic toolkit for clustering and data analysis. The generalized relative entropy constructed via mixture-regularization and symmetrization satisfies the triangle inequality, has a finite, adjustable range, and is symmetric, providing a practical alternative in machine learning and information geometry (Liu et al., 2017).

5. Relative Entropy in Dynamical, Biological, and Complex Systems

KL divergence configures as a Lyapunov functional for a wide variety of dynamical systems:

Markov processes: $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 0 decreases monotonically towards the stationary distribution $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 1.
Replicator and Lotka–Volterra equations: $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 2 decreases provided $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 3 is an evolutionarily stable state.
Chemical reaction networks: generalized relative entropy with appropriate convexity structure decreases towards complex-balanced equilibriums, paralleling thermodynamic free-energy dissipation (Baez et al., 2015).

This behavior underlies versions of the second law, information gain in populations, and the emergence of equilibrium.

6. Applications, Extensions, and Limitations

Relative entropy informs a diverse array of quantitative diagnostics:

Hierarchical and statistical model selection: minimization of relative entropy underpins penalized likelihood and Bayesian model comparison.
Information flow in physics and deep learning: KL divergence tracks information loss under renormalization group flow in lattice systems and abstraction in neural networks, showing monotonic increases in distinguishability but failing to detect critical points and phase transitions—limitations that motivate higher-order and more local information-theoretic diagnostics (Erdmenger et al., 2021).
Quantum measurements and incompatibility: KL divergence-based uncertainty relations provide operational measurement indeterminacy bounds, defining the entropic incompatibility degree in joint measurement scenarios (Barchielli et al., 2016).

KL divergence and its variants are not symmetric, lack the triangle inequality, and are unbounded. Metric generalizations overcome these limitations, at the cost of forfeiting some operational interpretations. Non-additive generalizations accommodate mutual gravitational information in cosmology, but reduce to standard KL divergence in appropriate limits (Czinner et al., 2016, Liu et al., 2017).

7. Axiomatic and Structural Foundations

A rigorous axiomatic framework demonstrates that monotonicity under data-processing, additivity, and normalization fully characterize the class of relative entropies. There is a canonical one-to-one correspondence between entropy functions and corresponding relative entropies, established by bijective constructions involving majorization and trumping (Gour et al., 2020). All relative entropies are bounded by the order-0 and order- $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 4 Rényi divergences, with faithfulness equivalent to strict positivity, and their operational ordering captured by catalytic transformations.

Summary Table: Core Properties Across Settings

Property	Classical KL Divergence	Quantum Relative Entropy	Generalized/Metric Versions
Non-negativity	Yes	Yes	Yes
Symmetry	No	No	Yes (generalized, e.g., $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 5)
Triangle Inequality	No	No	Yes (generalized, e.g., $D(P\Vert Q) = \sum_{x\in X} P(x) \log \frac{P(x)}{Q(x)}.$ 6)
Additivity	Yes (products)	Yes (tensor products)	Yes (by construction)
Data-processing/Monotonicity	Yes	Yes (CPTP)	Yes (if fitting axiomatic framework)
Faithfulness	Yes, except for degenerate cases	Yes	Yes (by strict positivity construction)

Relative information entropy and its extensions provide a rich, mathematically rigorous architecture for quantifying distinguishability, complexity, and informational dynamics across classical, quantum, statistical, and geometric frameworks, with operational significance tightly coupled to the properties dictated by the underlying axiomatic principles (0808.4111, Shore, 2013, Morita et al., 2010, Berta et al., 2014, Gour et al., 2020, Liu et al., 2017, Baez et al., 2015).