Papers
Topics
Authors
Recent
2000 character limit reached

Mass-Covering f-Divergences

Updated 1 October 2025
  • Mass-Covering f-Divergences are information functionals defined via local geometric deformations using an optimal transport map that aligns probability densities.
  • They combine properties of classical f-divergences with optimal transport to deliver invariance, convexity, and a powerful variational duality for robust statistical inference.
  • Their application in generative modeling ensures complete mass coverage by penalizing local disparities, thereby mitigating mode collapse in complex data distributions.

Transport f-divergences are a recently introduced class of information functionals that blend concepts from classical f-divergences and optimal transport to measure discrepancies between probability densities on one-dimensional sample spaces (Li, 22 Apr 2025). Unlike standard f-divergences, which compare two distributions via pointwise density ratios, transport f-divergences compare local geometric deformations induced by the optimal transport map that aligns one distribution with another. This construction endows transport f-divergences with properties particularly suited for mass-covering applications in statistics, machine learning, and generative modeling.

1. Mathematical Formulation of Transport f-Divergences

Consider smooth, strictly positive probability densities p,qp, q defined on ΩR\Omega \subseteq \mathbb{R}. Let T:ΩΩT: \Omega \to \Omega be the unique monotone increasing transport map such that the pushforward T#q=pT_\# q = p. Explicitly, TT is given by

T(x)=Qp(Fq(x)),T(x) = Q_p(F_q(x)),

where FqF_q is the CDF of qq and QpQ_p is the quantile function of pp. The fundamental Monge–Ampère relation holds: p(T(x))T(x)=q(x).p(T(x)) \cdot T'(x) = q(x).

Given a convex function f:R+R+f: \mathbb{R}_+ \to \mathbb{R}_+ with f(1)=0f(1) = 0, the transport f-divergence is defined by

DT,f(pq)=Ωf(q(x)p(T(x)))q(x)dx.D_{T,f}(p \| q) = \int_{\Omega} f\left(\frac{q(x)}{p(T(x))}\right) q(x) \, dx.

Using the Monge–Ampère condition, this can be recast as

DT,f(pq)=Ωf(T(x))q(x)dx,D_{T,f}(p \| q) = \int_{\Omega} f\big(T'(x)\big) q(x) \, dx,

or, equivalently, in terms of quantile densities,

DT,f(pq)=01f(Qp(u)Qq(u))du.D_{T,f}(p \| q) = \int_0^1 f\left( \frac{Q_p'(u)}{Q_q'(u)} \right) du.

The derivative T(x)T'(x) quantifies the local stretching or contraction needed to map q(x)q(x) to p(T(x))p(T(x)), and thus encapsulates the geometric “effort” required to align the two distributions.

2. Fundamental Properties and Structure

Transport f-divergences exhibit a suite of structural properties, synthesizing invariance and convexity characteristics from both component theories.

  • Invariance: For any smooth bijection k:ΩΩk: \Omega \to \Omega, the divergence is invariant under pushforwards: if p~=k#p\tilde{p} = k_\# p, q~=k#q\tilde{q} = k_\# q, then DT,f(pq)=DT,f(p~q~)D_{T,f}(p\|q) = D_{T,f}(\tilde{p}\|\tilde{q}).
  • Duality: Define f^(u)=f(1/u)\hat{f}(u) = f(1/u). Then DT,f(pq)=DT,f^(qp)D_{T,f}(p\|q) = D_{T,\hat{f}}(q\|p), mirroring the asymmetric structure of classical f-divergences.
  • Convexity: The functional is convex in the sense that if p1p_1 and p2p_2 correspond to different transports from qq, then any Wasserstein-geodesic mixture pλ=Tλ#qp_\lambda = T_\lambda^\# q (where Tλ=(1λ)T1+λT2T_\lambda = (1-\lambda)T_1 + \lambda T_2) satisfies

DT,f(pλq)(1λ)DT,f(p1q)+λDT,f(p2q).D_{T,f}(p_\lambda \| q) \leq (1-\lambda) D_{T,f}(p_1\|q) + \lambda D_{T,f}(p_2\|q).

  • Additivity: Given convex functions f1,f2f_1, f_2 and scalar a>0a>0,

DT,f1+af2(pq)=DT,f1(pq)+aDT,f2(pq).D_{T, f_1 + a f_2}(p\|q) = D_{T, f_1}(p\|q) + a D_{T, f_2}(p\|q).

3. Variational and Dual Representations

Transport f-divergences admit a variational duality structure akin to that of classical f-divergences but adapted to the transport geometry. Denote by f^\hat{f}^* the Legendre transform of f^\hat{f}. Then

DT,f(pq)=supΨC(Ω)[ΩΨ(x)p(T(x))dxΩf^(Ψ(x))q(x)dx],D_{T,f}(p\|q) = \sup_{\Psi \in C(\Omega)} \left[ \int_\Omega \Psi(x) p(T(x)) dx - \int_\Omega \hat{f}^*(\Psi(x)) q(x) dx \right],

where the optimal dual variable is given by Ψ(x)=f^(T(x))\Psi^*(x) = \hat{f}'(T'(x)). This variational structure is beneficial for algorithmic purposes, such as learning or inference in models involving transport divergences.

4. Local Metric and Taylor Expansions

Transport f-divergences possess a well-characterized local (infinitesimal) structure. Suppose pλ=((1λ)id+λT)#qp_\lambda = ((1-\lambda)\mathrm{id} + \lambda T)_\# q interpolates between the identity and transport map. Provided f(1)=f(1)=0f(1) = f'(1) = 0, the divergence behaves for small λ0\lambda \to 0 as

limλ01λ2DT,f(pλq)=f(1)2ΩT(x)12q(x)dx,\lim_{\lambda\to 0} \frac{1}{\lambda^2} D_{T,f}(p_\lambda\|q) = \frac{f''(1)}{2} \int_{\Omega} |T'(x) - 1|^2 q(x) dx,

highlighting a quadratic scaling analogous to the Fisher information metric in the classical f-divergence setting. In quantile coordinates,

DT,f(pq)=01(f(1)2h(u)2+f(1)6h(u)3)du+O(h4),D_{T,f}(p\|q) = \int_0^1 \left( \frac{f''(1)}{2} h(u)^2 + \frac{f'''(1)}{6} h(u)^3 \right) du + O(\|h\|^4),

where h(u)=Qp(u)Qq(u)Qq(u)h(u) = \frac{Q_p'(u) - Q_q'(u)}{Q_q'(u)}.

5. Applications in Mass-Covering and Generative Modeling

Transport f-divergences are particularly suited to applications requiring robust mass covering—that is, avoiding mode collapse and penalizing insufficient coverage of the target distribution's support.

Generative Models: In a generative framework, one often constructs random variables

X=G(Z,θX),Y=G(Z,θY)X = G(Z, \theta_X), \quad Y = G(Z, \theta_Y)

using a latent variable ZZ drawn from a known reference prefp_\text{ref}. The transport divergence between pXp_X and pYp_Y can be explicitly written as

DT,f(pXpY)=EZpref[f(ZG(Z,θX)ZG(Z,θY))].D_{T,f}(p_X\|p_Y) = \mathbb{E}_{Z \sim p_\text{ref}}\left[ f\left( \frac{\partial_Z G(Z, \theta_X)}{\partial_Z G(Z, \theta_Y)} \right) \right].

Thus, the divergence detects and penalizes differences in the Jacobians of the generative mappings, quantifying not just pointwise density mismatch but also how the maps distribute mass globally. This property is crucial for ensuring that learned models capture all meaningful modes of the data distribution rather than concentrating on high-density subsets.

Mass-Covering: Because the transport f-divergence penalizes local stretching (or compression) of the transport, it imposes a penalty for any region of the target that the model fails to cover. For instance, if a mode is missing in pp, the required T(x)T'(x) to match q(x)q(x) will be infinite, leading to divergence inflation—thus, “mass covering” is compelled by the structure of the divergence.

6. Theoretical Relevance and Perspectives

Transport f-divergences unify key concepts from information geometry and optimal transport. The dependency on TT' bridges the gap between classical mismatch metrics (like KL or Hellinger) and Wasserstein distances, inheriting desirable geometric stability and invariance. Their Taylor expansions evidence a local metric structure, positioning them as “second-order” divergences on the space of densities with strong mass-covering semantics. The invariance and variational forms make them compatible with a variety of coordinate systems and optimization pipelines in statistics and machine learning, especially for applications in generative modeling, variational inference, and robust estimation.

7. Summary Table: Key Properties

Aspect Transport f-Divergence Implication for Mass-Covering
Penalty Type Local stretching via T(x)T'(x) Detects missing/under-represented modes
Invariance Pushforward (bijective change of variables) Robust to coordinate transformations
Variational duality Supremum form with Legendre dual of ff Supports variational estimation and optimization
Local metric Quadratic scaling via f(1)f''(1) in T1T'-1 Sensitive to “small” mass reallocation
Application Generative models (comparisons via Jacobians) Encourages faithful mass covering

In conclusion, transport f-divergences offer a powerful, theoretically principled, and practically relevant framework for comparing probability densities and generative models, emphasizing geometric stretching and thus furnishing strong mass-covering guarantees (Li, 22 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mass-Covering f-Divergences.