Mass-Covering f-Divergences
- Mass-Covering f-Divergences are information functionals defined via local geometric deformations using an optimal transport map that aligns probability densities.
- They combine properties of classical f-divergences with optimal transport to deliver invariance, convexity, and a powerful variational duality for robust statistical inference.
- Their application in generative modeling ensures complete mass coverage by penalizing local disparities, thereby mitigating mode collapse in complex data distributions.
Transport f-divergences are a recently introduced class of information functionals that blend concepts from classical f-divergences and optimal transport to measure discrepancies between probability densities on one-dimensional sample spaces (Li, 22 Apr 2025). Unlike standard f-divergences, which compare two distributions via pointwise density ratios, transport f-divergences compare local geometric deformations induced by the optimal transport map that aligns one distribution with another. This construction endows transport f-divergences with properties particularly suited for mass-covering applications in statistics, machine learning, and generative modeling.
1. Mathematical Formulation of Transport f-Divergences
Consider smooth, strictly positive probability densities defined on . Let be the unique monotone increasing transport map such that the pushforward . Explicitly, is given by
where is the CDF of and is the quantile function of . The fundamental Monge–Ampère relation holds:
Given a convex function with , the transport f-divergence is defined by
Using the Monge–Ampère condition, this can be recast as
or, equivalently, in terms of quantile densities,
The derivative quantifies the local stretching or contraction needed to map to , and thus encapsulates the geometric “effort” required to align the two distributions.
2. Fundamental Properties and Structure
Transport f-divergences exhibit a suite of structural properties, synthesizing invariance and convexity characteristics from both component theories.
- Invariance: For any smooth bijection , the divergence is invariant under pushforwards: if , , then .
- Duality: Define . Then , mirroring the asymmetric structure of classical f-divergences.
- Convexity: The functional is convex in the sense that if and correspond to different transports from , then any Wasserstein-geodesic mixture (where ) satisfies
- Additivity: Given convex functions and scalar ,
3. Variational and Dual Representations
Transport f-divergences admit a variational duality structure akin to that of classical f-divergences but adapted to the transport geometry. Denote by the Legendre transform of . Then
where the optimal dual variable is given by . This variational structure is beneficial for algorithmic purposes, such as learning or inference in models involving transport divergences.
4. Local Metric and Taylor Expansions
Transport f-divergences possess a well-characterized local (infinitesimal) structure. Suppose interpolates between the identity and transport map. Provided , the divergence behaves for small as
highlighting a quadratic scaling analogous to the Fisher information metric in the classical f-divergence setting. In quantile coordinates,
where .
5. Applications in Mass-Covering and Generative Modeling
Transport f-divergences are particularly suited to applications requiring robust mass covering—that is, avoiding mode collapse and penalizing insufficient coverage of the target distribution's support.
Generative Models: In a generative framework, one often constructs random variables
using a latent variable drawn from a known reference . The transport divergence between and can be explicitly written as
Thus, the divergence detects and penalizes differences in the Jacobians of the generative mappings, quantifying not just pointwise density mismatch but also how the maps distribute mass globally. This property is crucial for ensuring that learned models capture all meaningful modes of the data distribution rather than concentrating on high-density subsets.
Mass-Covering: Because the transport f-divergence penalizes local stretching (or compression) of the transport, it imposes a penalty for any region of the target that the model fails to cover. For instance, if a mode is missing in , the required to match will be infinite, leading to divergence inflation—thus, “mass covering” is compelled by the structure of the divergence.
6. Theoretical Relevance and Perspectives
Transport f-divergences unify key concepts from information geometry and optimal transport. The dependency on bridges the gap between classical mismatch metrics (like KL or Hellinger) and Wasserstein distances, inheriting desirable geometric stability and invariance. Their Taylor expansions evidence a local metric structure, positioning them as “second-order” divergences on the space of densities with strong mass-covering semantics. The invariance and variational forms make them compatible with a variety of coordinate systems and optimization pipelines in statistics and machine learning, especially for applications in generative modeling, variational inference, and robust estimation.
7. Summary Table: Key Properties
| Aspect | Transport f-Divergence | Implication for Mass-Covering |
|---|---|---|
| Penalty Type | Local stretching via | Detects missing/under-represented modes |
| Invariance | Pushforward (bijective change of variables) | Robust to coordinate transformations |
| Variational duality | Supremum form with Legendre dual of | Supports variational estimation and optimization |
| Local metric | Quadratic scaling via in | Sensitive to “small” mass reallocation |
| Application | Generative models (comparisons via Jacobians) | Encourages faithful mass covering |
In conclusion, transport f-divergences offer a powerful, theoretically principled, and practically relevant framework for comparing probability densities and generative models, emphasizing geometric stretching and thus furnishing strong mass-covering guarantees (Li, 22 Apr 2025).