Mass-Covering f-Divergences

Updated 1 October 2025

Mass-Covering f-Divergences are information functionals defined via local geometric deformations using an optimal transport map that aligns probability densities.
They combine properties of classical f-divergences with optimal transport to deliver invariance, convexity, and a powerful variational duality for robust statistical inference.
Their application in generative modeling ensures complete mass coverage by penalizing local disparities, thereby mitigating mode collapse in complex data distributions.

Transport f-divergences are a recently introduced class of information functionals that blend concepts from classical f-divergences and optimal transport to measure discrepancies between probability densities on one-dimensional sample spaces (Li, 22 Apr 2025). Unlike standard f-divergences, which compare two distributions via pointwise density ratios, transport f-divergences compare local geometric deformations induced by the optimal transport map that aligns one distribution with another. This construction endows transport f-divergences with properties particularly suited for mass-covering applications in statistics, machine learning, and generative modeling.

1. Mathematical Formulation of Transport f-Divergences

Consider smooth, strictly positive probability densities $p, q$ defined on $\Omega \subseteq \mathbb{R}$ . Let $T: \Omega \to \Omega$ be the unique monotone increasing transport map such that the pushforward $T_\# q = p$ . Explicitly, $T$ is given by

$T(x) = Q_p(F_q(x)),$

where $F_q$ is the CDF of $q$ and $Q_p$ is the quantile function of $p$ . The fundamental Monge–Ampère relation holds: $p(T(x)) \cdot T'(x) = q(x).$

Given a convex function $f: \mathbb{R}_+ \to \mathbb{R}_+$ with $f(1) = 0$ , the transport f-divergence is defined by

$D_{T,f}(p \| q) = \int_{\Omega} f\left(\frac{q(x)}{p(T(x))}\right) q(x) \, dx.$

Using the Monge–Ampère condition, this can be recast as

$D_{T,f}(p \| q) = \int_{\Omega} f\big(T'(x)\big) q(x) \, dx,$

or, equivalently, in terms of quantile densities,

$D_{T,f}(p \| q) = \int_0^1 f\left( \frac{Q_p'(u)}{Q_q'(u)} \right) du.$

The derivative $T'(x)$ quantifies the local stretching or contraction needed to map $q(x)$ to $p(T(x))$ , and thus encapsulates the geometric “effort” required to align the two distributions.

2. Fundamental Properties and Structure

Transport f-divergences exhibit a suite of structural properties, synthesizing invariance and convexity characteristics from both component theories.

Invariance: For any smooth bijection $k: \Omega \to \Omega$ , the divergence is invariant under pushforwards: if $\tilde{p} = k_\# p$ , $\tilde{q} = k_\# q$ , then $D_{T,f}(p\|q) = D_{T,f}(\tilde{p}\|\tilde{q})$ .
Duality: Define $\hat{f}(u) = f(1/u)$ . Then $D_{T,f}(p\|q) = D_{T,\hat{f}}(q\|p)$ , mirroring the asymmetric structure of classical f-divergences.
Convexity: The functional is convex in the sense that if $p_1$ and $p_2$ correspond to different transports from $q$ , then any Wasserstein-geodesic mixture $p_\lambda = T_\lambda^\# q$ (where $T_\lambda = (1-\lambda)T_1 + \lambda T_2$ ) satisfies

$D_{T,f}(p_\lambda \| q) \leq (1-\lambda) D_{T,f}(p_1\|q) + \lambda D_{T,f}(p_2\|q).$

Additivity: Given convex functions $f_1, f_2$ and scalar $a>0$ ,

$D_{T, f_1 + a f_2}(p\|q) = D_{T, f_1}(p\|q) + a D_{T, f_2}(p\|q).$

3. Variational and Dual Representations

Transport f-divergences admit a variational duality structure akin to that of classical f-divergences but adapted to the transport geometry. Denote by $\hat{f}^*$ the Legendre transform of $\hat{f}$ . Then

$D_{T,f}(p\|q) = \sup_{\Psi \in C(\Omega)} \left[ \int_\Omega \Psi(x) p(T(x)) dx - \int_\Omega \hat{f}^*(\Psi(x)) q(x) dx \right],$

where the optimal dual variable is given by $\Psi^*(x) = \hat{f}'(T'(x))$ . This variational structure is beneficial for algorithmic purposes, such as learning or inference in models involving transport divergences.

4. Local Metric and Taylor Expansions

Transport f-divergences possess a well-characterized local (infinitesimal) structure. Suppose $p_\lambda = ((1-\lambda)\mathrm{id} + \lambda T)_\# q$ interpolates between the identity and transport map. Provided $f(1) = f'(1) = 0$ , the divergence behaves for small $\lambda \to 0$ as

$\lim_{\lambda\to 0} \frac{1}{\lambda^2} D_{T,f}(p_\lambda\|q) = \frac{f''(1)}{2} \int_{\Omega} |T'(x) - 1|^2 q(x) dx,$

highlighting a quadratic scaling analogous to the Fisher information metric in the classical f-divergence setting. In quantile coordinates,

$D_{T,f}(p\|q) = \int_0^1 \left( \frac{f''(1)}{2} h(u)^2 + \frac{f'''(1)}{6} h(u)^3 \right) du + O(\|h\|^4),$

where $h(u) = \frac{Q_p'(u) - Q_q'(u)}{Q_q'(u)}$ .

5. Applications in Mass-Covering and Generative Modeling

Transport f-divergences are particularly suited to applications requiring robust mass covering—that is, avoiding mode collapse and penalizing insufficient coverage of the target distribution's support.

Generative Models: In a generative framework, one often constructs random variables

$X = G(Z, \theta_X), \quad Y = G(Z, \theta_Y)$

using a latent variable $Z$ drawn from a known reference $p_\text{ref}$ . The transport divergence between $p_X$ and $p_Y$ can be explicitly written as

$D_{T,f}(p_X\|p_Y) = \mathbb{E}_{Z \sim p_\text{ref}}\left[ f\left( \frac{\partial_Z G(Z, \theta_X)}{\partial_Z G(Z, \theta_Y)} \right) \right].$

Thus, the divergence detects and penalizes differences in the Jacobians of the generative mappings, quantifying not just pointwise density mismatch but also how the maps distribute mass globally. This property is crucial for ensuring that learned models capture all meaningful modes of the data distribution rather than concentrating on high-density subsets.

Mass-Covering: Because the transport f-divergence penalizes local stretching (or compression) of the transport, it imposes a penalty for any region of the target that the model fails to cover. For instance, if a mode is missing in $p$ , the required $T'(x)$ to match $q(x)$ will be infinite, leading to divergence inflation—thus, “mass covering” is compelled by the structure of the divergence.

6. Theoretical Relevance and Perspectives

Transport f-divergences unify key concepts from information geometry and optimal transport. The dependency on $T'$ bridges the gap between classical mismatch metrics (like KL or Hellinger) and Wasserstein distances, inheriting desirable geometric stability and invariance. Their Taylor expansions evidence a local metric structure, positioning them as “second-order” divergences on the space of densities with strong mass-covering semantics. The invariance and variational forms make them compatible with a variety of coordinate systems and optimization pipelines in statistics and machine learning, especially for applications in generative modeling, variational inference, and robust estimation.

7. Summary Table: Key Properties

Aspect	Transport f-Divergence	Implication for Mass-Covering
Penalty Type	Local stretching via $T'(x)$	Detects missing/under-represented modes
Invariance	Pushforward (bijective change of variables)	Robust to coordinate transformations
Variational duality	Supremum form with Legendre dual of $f$	Supports variational estimation and optimization
Local metric	Quadratic scaling via $f''(1)$ in $T'-1$	Sensitive to “small” mass reallocation
Application	Generative models (comparisons via Jacobians)	Encourages faithful mass covering

In conclusion, transport f-divergences offer a powerful, theoretically principled, and practically relevant framework for comparing probability densities and generative models, emphasizing geometric stretching and thus furnishing strong mass-covering guarantees (Li, 22 Apr 2025).

PDF Markdown Chat (Pro)

References (1)

Transport f divergences (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Mass-Covering f-Divergences.