Donsker–Varadhan Dual

Updated 5 March 2026

Donsker–Varadhan Dual is a variational representation of relative entropy that underpins large deviation theory and modern inference techniques.
It provides a dual formulation linking relative entropy, principal eigenvalues, and rate functionals in Markov processes, diffusions, and controlled settings.
The dual is widely applied in density estimation, mutual information maximization, and robust optimization, offering practical insights for algorithmic implementations.

The Donsker–Varadhan (DV) dual, or Donsker–Varadhan variational formula, is a foundational result in the theory of large deviations, statistical mechanics, and information theory. It provides a dual (variational) representation of relative entropy (Kullback–Leibler divergence) and, by extension, rate functionals, principal eigenvalues, and related quantities for Markov processes, diffusions, and beyond. The DV dual underpins modern approaches to density estimation, variational inference, mutual information estimation, control, and more, with a range of both classical and contemporary generalizations.

1. The Donsker–Varadhan Variational Formula for Relative Entropy

For probability measures $\mathbb{P}$ and $\mathbb{Q}$ on a common measurable space $\mathcal{X}$ , with densities $p(x)$ and $q(x)$ , the Donsker–Varadhan variational representation of the Kullback–Leibler divergence is

$D_{KL}(\mathbb{P}\Vert\mathbb{Q}) = \sup_{T:\,\mathcal{X}\to\mathbb{R}} \Bigl\{\,\mathbb{E}_{x\sim\mathbb{P}}[T(x)] - \log \mathbb{E}_{x\sim\mathbb{Q}}[e^{T(x)}]\,\Bigr\},$

where the supremum runs over all bounded measurable functions $T$ such that $\mathbb{E}_\mathbb{Q}[e^{T(x)}] < \infty$ (Park et al., 2021). This bound is tight, with equality achieved for the optimal critic

$T^*(x) = \log\frac{p(x)}{q(x)} + \log \mathbb{E}_{\mathbb{Q}}[e^{T^*(x)}]$

which recovers $\log p(x)$ up to a constant when $q$ is uniform. In finite spaces, this formulation reduces to the classic characterization of the relative entropy and appears as the variational rate function for Markov occupation measures (Renger, 2024).

2. Duality in Large Deviations: Markov Processes, Diffusions, and Beyond

The DV dual is central in large-deviation theory for Markov processes, both at the level of occupation measures (level-2) and the full empirical process (level-3). For an ergodic continuous-time Markov process on a finite state space with generator $Q$ , the large-deviation rate functional for the empirical measure $\mu_T$ is

$I(\rho) = \sup_{u:X\to(0,\infty)} \left[-\sum_{x} \rho_x\, \frac{(Q u)_x}{u_x}\right],$

with the supremum over strictly positive functions $u$ (Renger, 2024). In the diffusion setting, the rate functional admits an analogous DV variational formula: $I_\epsilon(\mu) = \sup_{f\in C_b(\mathbb{R}^n)} \left\{ \int f\,d\mu - \Lambda_\epsilon(f) \right\}$ where $\Lambda_\epsilon(f)$ is the principal eigenvalue of the corresponding Feynman–Kac semigroup (Bertini et al., 2022). For processes with degenerate or absorbing states, the dual representation persists, although the zero-level set of the rate function widens (Basile et al., 2013).

The dual formulation is also pivotal in controlled Markov processes and risk-sensitive control, where the principal eigenvalue of a dynamic programming operator admits a controlled DV variational formula (Arapostathis et al., 2019): $\log \lambda = \sup_{\eta\in\mathcal{G}} \left\{ \int r(x,u,y)\,d\eta - \int D(\eta_2(\cdot|x,u) \Vert p(\cdot|x,u))\,d\eta_0\,d\eta_1 \right\},$ over ergodic occupation measures $\eta$ .

3. Applications in Density Estimation, Mutual Information, and Optimization

The DV dual has found transformative application in density estimation. If $\mathbb{Q}$ is a uniform measure, the optimal critic $T^*$ parameterized by a neural network recovers $\log p(x)$ up to a constant. This enables deep data density estimation via stochastic optimization, as formalized in Deep Data Density Estimation (DDDE), where the loss is

$L_{DV}(\theta) = \mathbb{E}_{\mathbb{P}}\big[\log f_\theta(x)\big] - \log \mathbb{E}_{\mathbb{Q}}\big[f_\theta(x)\big]$

for $f_\theta(x) = e^{T_\theta(x)}$ (Park et al., 2021).

In information theory and representation learning, the DV dual underlies neural mutual information estimators. Mutual information admits the lower bound

$I(X;Y) \geq \sup_{T} \left\{ \mathbb{E}_{P_{XY}}[T(x,y)] - \log\mathbb{E}_{P_X P_Y}[e^{T(x,y)}] \right\}$

which is the basis of the MINE estimator (Lv et al., 27 Jun 2025). This perspective unifies contrastive learning, RLHF, and DPO as instances of variational MI maximization.

In distributionally robust optimization, the DV dual is used to derive tractable surrogates for the inner maximization over adversarial data distributions, reducing min–sup DRO objectives to log-sum-exp losses with well-defined properties and straightforward gradients (Shao et al., 14 Jan 2026).

4. Generalizations: Maxitive and Non-Additive Extensions

Recent developments extend the DV dual to the maxitive (possibilistic) setting relevant for imprecise probability and possibility theory. Here, integration and expectation are replaced by suprema; KL divergence is replaced by max-relative entropy,

$D_{\max}(g \Vert f) = \sup_{\theta}\log\frac{g(\theta)}{f(\theta)},$

and the DV theorem is replaced by the maxitive Donsker–Varadhan theorem,

$\log Z_{\max} = \sup_{g\in\mathcal{F}} \inf_{\theta} \left\{ -\ell(\theta) - \log\frac{g(\theta)}{\pi(\theta)} \right\}$

with dual attainment properties governed by order-duality rather than convex analytic duality. Canonical solutions correspond to generalized Gibbs posteriors, and the variational structure supports extensions to exponential families, Bregman divergences, and conjugacy in the possibilistic regime (Singh et al., 26 Nov 2025).

5. Algorithmic Implementations and Limitations

In practical settings, direct optimization over all measurable functions $T$ is infeasible; parameterizations (e.g., neural networks) are used. In density estimation, practicalities include positive output constraints (e.g., ELU+offset in $f_\theta$ ), exponential moving averages to stabilize log-moment estimates, and minibatch stochastic gradients. Numerical and statistical efficiency may be limited by the “curse of dimensionality” if the proposal $\mathbb{Q}$ is uniform in high dimension, motivating adaptive proposals or learned sampler networks (Park et al., 2021). In sample-based optimization (e.g., mutual information neural estimation), the log-sum-exp structure in the DV bound may induce high variance; alternative surrogates such as Jensen–Shannon estimators have been proposed for higher numerical stability and reduced variance, as shown in the Mutual Information Optimization (MIO) framework (Lv et al., 27 Jun 2025).

6. Theoretical and Structural Implications

The DV dual exhibits broad structural parallels across probabilistic, maxitive, and controlled settings:

In probabilistic contexts, it emerges from convex duality and the Gibbs variational principle.
In the maxitive/possibilistic regime, its order-theoretic duality replaces expectations/integrals and entropy with sup–inf operations and max-relative divergences.
In controlled processes and risk-sensitive control, the DV dual variationally characterizes principal eigenvalues and optimal growth rates, linking to Collatz–Wielandt identities, occupational measures, and multiplicative dynamic programming (Arapostathis et al., 2019).
Large deviation principles for empirical measures, empirical flows, and empirical currents in particle systems, diffusions, and nonequilibrium statistical mechanics fundamentally rely on the DV dual at their core (Bertini et al., 2021, Bertini et al., 2022, Zhao, 17 Jun 2025, Basile et al., 2013).

Further open questions pertain to the extension of the DV framework to non-additive measures in infinite-dimensional spaces, algorithmic scalable optimization in maxitive settings, and hybrid divergences interpolating between additive and sup–based paradigms (Singh et al., 26 Nov 2025). The DV dual thus provides a universal analytic and algorithmic backbone for a spectrum of advances in statistical learning, modern inference, and stochastic analysis.