Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

140 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Characteristic Kernels for Probability Distributions

Updated 30 June 2025

Characteristic kernels are reproducing kernels that injectively embed probability distributions into an RKHS, ensuring distinct representation via metrics like the Maximum Mean Discrepancy.
They underlie consistent statistical tests and kernel methods, enabling applications such as two-sample testing, independence detection, and clustering on complex or structured data.
Their effectiveness is rooted in properties like integral strict positive definiteness and complete Fourier support, which guarantee the reliability and robustness of kernel-based learning.

A characteristic kernel for probability distributions is a reproducing kernel that induces an injective mean embedding from probability measures into a reproducing kernel Hilbert space (RKHS), ensuring that the distance between embeddings distinguishes distributions uniquely. Characteristic kernels underpin a wide range of kernel methods for distributional data, enabling principled metrics (such as the Maximum Mean Discrepancy, MMD), consistent statistical testing, and kernel-based learning on distributions.

1. Mathematical Definition and Criteria

A bounded measurable kernel $k$ on a measurable space $(M, \mathcal{A})$ is characteristic if the mapping

$\mathscr{P} \ni P \mapsto \int_M k(\cdot, x)\, dP(x) \in \mathcal{H}$

is injective for all probability measures $P$ on $M$ . That is, for all $P, Q$ ,

$\gamma_k(P, Q) = 0 \iff P = Q,$

where

$\gamma_k(P, Q) = \left\| \int k(\cdot, x)\, dP(x) - \int k(\cdot, x) dQ(x) \right\|_{\mathcal{H}}$

is the RKHS-normed distance between the embeddings.

The conditions for a kernel to be characteristic are:

Integrally strictly positive definite (ispd): For all finite signed nonzero measures $\mu$ ,

$\int \!\! \int k(x, y)\, d\mu(x)\, d\mu(y) > 0.$

Translation-invariant kernels $k(x, y) = \psi(x-y)$ on $\mathbb{R}^d$ : $k$ is characteristic if and only if the support of its Fourier transform covers all of $\mathbb{R}^d$ : $\mathrm{supp}(\widehat{\psi}) = \mathbb{R}^d.$
Compact torus $\mathbb{T}^d$ : The associated Fourier coefficients must be positive everywhere except possibly at zero.

Examples: Gaussian, Laplacian, Matérn, inverse multiquadric, and certain compactly supported B-spline kernels are characteristic.

2. Hilbert Space Embedding of Probability Measures

Probability measures are embedded as mean elements in the RKHS: $\Pi(P) := \int_M k(\cdot, x)\, dP(x) \in \mathcal{H}.$ This generalizes the embedding by the classical characteristic function, especially for translation-invariant kernels.

For any measures $P, Q$ , the distance between their embeddings,

$\gamma_k^2(P, Q) = \mathbb{E}_{X, X'}[k(X, X')] + \mathbb{E}_{Y, Y'}[k(Y, Y')] - 2\mathbb{E}_{X, Y}[k(X, Y)],$

with $X \sim P, Y \sim Q$ , is the Maximum Mean Discrepancy (MMD). For translation-invariant $k$ , this can be expressed in terms of the $L^2$ distance between characteristic functions weighted by the spectral measure.

3. Comparison with Classical Probability Metrics

Key probability distances for comparison include the total variation ( $TV$ ), Wasserstein ( $W$ ), and Dudley ( $\beta$ ) metrics.

Strength order: $TV \succ W \succ \gamma_k$ . The kernel metric $\gamma_k$ is weaker than total variation and Wasserstein, but for universal kernels on compact spaces, it metrizes the weak topology, matching Dudley.
Bounds: For bounded $k$ , $\gamma_k(P, Q) \leq TV(P, Q) \sqrt{C}$ , with $C = \sup_x k(x,x)$ .
Weak topology metricization: If $k$ is universal and continuous (e.g., Gaussian on compact sets), $\gamma_k$ metrizes the topology of weak convergence.

4. Theoretical and Practical Implications

Statistical Testing: Characteristic kernels guarantee that two-sample (homogeneity) testing via MMD and independence testing (via the Hilbert-Schmidt Independence Criterion, HSIC) are consistent. That is, the test statistic is zero if and only if the null hypothesis holds.
Feature Selection and Independence Detection: Embedded distances can be used in selection and independence measures, provided the kernel is characteristic, ensuring all forms of dependence are captured.
Learning on Structured Data: RKHS embeddings with characteristic kernels can be constructed for complex domains (e.g., graphs, strings), enabling generalization of kernel methods to structured distributions.

5. Relation to Universal Kernels and Strict Positive Definiteness

Let $\mathcal{H}$ be the RKHS induced by $k$ :

Universal kernel: $\mathcal{H}$ is dense in a target function space (e.g., $C(M)$ ), and the mean embedding is injective for all finite signed measures ( $\mu \neq 0 \implies \int k(\cdot, x)\, d\mu(x) \neq 0$ ).
Hierarchy: Universal $\implies$ characteristic, but not conversely, unless in specific settings (e.g., for radial or translation-invariant kernels on $\mathbb{R}^d$ , universality and characteristicness are equivalent).
Strict positive definiteness: A strictly p.d. kernel (where $\sum_{i, j} \alpha_i \alpha_j k(x_i, x_j) > 0$ for all nonzero real $(\alpha_i)$ summing to zero) is characteristic.

6. Applications, Limitations, and Computational Aspects

Applications

Dimensionality Reduction: Kernel PCA and clustering using kernel mean embeddings.
Nonparametric Density and Independence Testing: Consistent and computationally efficient estimation of distances between distributions from samples, applicable to high-dimensional or non-Euclidean spaces.
Learning and Generative Modeling: Embedding, comparing, and optimizing over probability distributions as elements in an RKHS.

Limitations

With non-characteristic kernels, different distributions may map to the same embedding, causing statistical tests to lose power.
Even with characteristic kernels, distributions can be made arbitrarily close in kernel distance while remaining distinctly different in tail or fine structure—particularly in low-sample regimes and for differences invisible to the kernel's features.

Implementation and Computation

For bounded measurable kernels, the empirical estimator $\gamma_k(P_m, Q_n)$ converges quickly and is straightforward to compute.
The choice of kernel affects not only discriminative power but also the topology of convergence for learning algorithms.

Property	Mathematical Condition	Practical Implication
Characteristic	Injective mean embedding for probability measures	MMD/HSIC is a true metric, consistent testing
Universal	Injective embedding for all finite signed measures	Strongest: density, functional approximation, metrizes weak topology
Strictly Positive Def.	$\int\!\!\int k(x, y) d\mu(x)d\mu(y) > 0$ for $\mu \ne 0$	Ensures injectivity (often suffices for characteristicness)

References to Key Theorems and Results

Integrally strictly positive definite kernels: Theorem 6
Fourier support and characteristic property (translation-invariant): Theorem 7
Empirical estimator consistency: Section 3.3
Comparison with classical metrics and topologies: Section 4

Conclusion

Characteristic kernels uniquely determine the embedding of probability distributions into RKHSs, encompassing a wide range of crucial statistical and learning-theoretic applications. Their selection directly affects the consistency, computational tractability, and discriminatory power of kernel-based statistical and machine learning approaches involving distributional data. Proper choice and understanding of characteristic kernels are essential for principled, theoretically-founded, and practically robust algorithms in modern data analysis.

PDF Markdown Chat (Upgrade)