Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

140 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Integral Probability Metrics Overview

Updated 4 July 2025

Integral Probability Metrics (IPMs) are distance measures defined by the supremum difference in expectations over chosen function classes, comparing probability distributions.
They include metrics like Wasserstein, MMD, Dudley, and Total Variation, each designed for specific regularity and computation needs.
IPMs underpin practical applications in machine learning and statistics by linking classifier risk with distributional discrepancies and enabling robust empirical estimation.

Integral Probability Metrics (IPMs) are a broad class of probability distances defined by taking the supremum of the difference in expectations of a function over two probability measures, where the function class is specified to capture certain regularity, geometry, or statistical properties. IPMs encompass distance measures such as the Wasserstein distance, Dudley metric, Total Variation, Maximum Mean Discrepancy, and others, and have become central to modern probability theory, machine learning, and statistics for tasks involving comparison, estimation, and learning with probability distributions.

1. Definition and Representative Instances

An Integral Probability Metric between probability measures $P$ and $Q$ on a measurable space $(M, \mathcal{A})$ is given by

$\gamma_{F}(P, Q) = \sup_{f \in F} \left| \int_{M} f\, dP - \int_{M} f\, dQ \right|$

where $F$ is a class of real-valued, bounded, measurable functions on $M$ .

Examples of IPMs include:

Wasserstein Distance ( $W_1$ $W_{1}$ ):
- $F = \{ f : \|f\|_L \le 1 \}$ , functions with Lipschitz constant at most 1.
- Dual form: $W_1(P, Q) = \sup_{\|f\|_L \leq 1} \left| \int f dP - \int f dQ \right|$ .
Dudley Metric ( $\beta$ $β$ ):
- $F = \{ f : \|f\|_{BL} \leq 1 \}$ , where $\|f\|_{BL} = \|f\|_\infty + \|f\|_L$ .
Maximum Mean Discrepancy (MMD):
- $F$ is the unit ball in a Reproducing Kernel Hilbert Space (RKHS).
- $\gamma_k(P, Q) = \| \mathbb{E}_{P}[k(\cdot, X)] - \mathbb{E}_{Q}[k(\cdot, X)] \|_{\mathcal{H}}$ .
Total Variation Distance (TV):
- $F = \{ f : \|f\|_\infty \leq 1 \}$ .
- $\mathrm{TV}(P, Q) = \sup_{\|f\|_\infty \leq 1} | \int f dP - \int f dQ |$ .
Kolmogorov Distance:
- $F$ is indicator functions of half-infinite intervals on $\mathbb{R}$ .

This framework includes many classical distances as special cases and can be tuned for the application by selecting $F$ appropriately.

2. Comparison with φ-Divergences and Theoretical Distinctions

The paper establishes that except for total variation, IPMs and φ-divergences (divergences defined by a convex function $\phi$ ) are fundamentally distinct. A φ-divergence has the form

$D_\phi(P, Q) = \int_M \phi\left( \frac{dP}{dQ} \right) dQ,$

for convex $\phi$ . Only the total variation distance is both an IPM and a non-trivial φ-divergence; all other principal φ-divergences (KL, χ², etc.) do not belong to the IPM family. This has important consequences:

Properties of φ-divergences (such as those relying on absolute continuity) do not carry over to generic IPMs.
In high-dimensional or low-overlap settings, φ-divergences may become infinite or poorly behaved, whereas IPMs (e.g., Wasserstein) remain defined and well-behaved.

3. Empirical Estimation and Computational Aspects

Given empirical measures $P_m, Q_n$ from finite i.i.d. samples,

$\gamma_F(P_m, Q_n) = \sup_{f \in F} \left| \frac{1}{m} \sum_{i=1}^{m} f(X^{(1)}_i) - \frac{1}{n} \sum_{j=1}^{n} f(X^{(2)}_j) \right|$

Estimation strategies depend on the chosen $F$ :

Wasserstein/Dudley: Reduction to a linear program with constraints reflecting Lipschitz or bounded Lipschitz conditions.
MMD: Closed-form, unbiased estimator involving only kernel evaluations, scalable to high dimensions.
Total Variation: Direct empirical version is not always strongly consistent.

Consistency and Convergence:

For MMD: Parametric convergence rate $O(n^{-1/2})$ , dimension-independent.
For Wasserstein and Dudley: Dimension-dependent rate $O(n^{-1/(d+1)})$ .
Empirical estimators can be computed efficiently for all sample sizes and tend to outperform φ-divergence estimators, which struggle in high dimensions and with disjoint supports.

4. Connection to Binary Classification

The IPM between two class-conditional distributions is directly related to the minimal risk of a binary classifier from a function class $F$ : $\gamma_F(P, Q) = -\min_{f \in F} \mathbb{E}[L(Y, f(X))]$ where $L$ is a specifically chosen loss function depending on class priors. Thus:

The IPM reflects the best achievable (negative) classification risk for a given class of functions.
For different $F$ $F$ , this yields:
- Wasserstein: Optimal risk for Lipschitz classifiers.
- MMD: For kernel classifiers.
- TV: For all bounded classifiers.
The "margin" or smoothness of the optimal classifier is linked to the IPM—lower IPM implies higher complexity (less smooth) classifiers are needed to distinguish the classes.

5. Practical Implications and Applications

Statistical advantages:

Robustness: IPMs are sensitive to all differences between distributions, not just those detectable by φ-divergences.
Computational feasibility: Linear programs or closed-form solutions make many IPMs readily usable in practice.
Dimension-independence: MMD and similar kernelized IPMs are attractive in very high-dimensional contexts.

Applications:

Hypothesis testing, two-sample tests, goodness-of-fit.
GANs and generative models as objectives to measure fit between real and generated distributions.
Image processing, neuroscience, and any field where model-data comparison is central.
Binary classification, with risk-minimization directly linked to measurable probabilistic distance between classes.

Summary Table of Major IPMs

Metric	Function Class ( $F$ )	Key Properties
Wasserstein ( $W_1$ )	1-Lipschitz	Weak topology, geometry
Dudley	BL-norm bounded ( $\\|f\\|_\infty + \\|f\\|_L \leq 1$ )	Topology-refined
MMD	Unit ball in RKHS	Efficient, kernel-two-sample
Total Variation	$\\|f\\|_\infty \leq 1$	Strongest, not always comput.
Kolmogorov	Indicator functions	1D; empirical CDF–based

In conclusion, IPMs constitute a unified, effective, and tractable framework for quantifying distance between probability measures. They provide statistical, computational, and interpretive advantages over classical φ-divergences, and their deep connection to binary classification risk grounds them in decision-theoretic perspectives. For practical problems involving high dimensional data, small sample regimes, or low class overlap, IPMs—particularly Wasserstein, MMD, and suitable custom function classes—are often preferred for both empirical and theoretical analyses.

PDF Markdown Chat (Upgrade)