Kernel Quantile Embeddings and Associated Probability Metrics (2505.20433v1)

Published 26 May 2025 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions as mean functions in RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion of kernel quantile embeddings (KQEs). We then use KQEs to construct a family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations.

Summary

The paper introduces kernel quantile embeddings (KQEs) to represent probability distributions in RKHS, proving injectivity under milder conditions than conventional methods.
It develops new probability metrics—e-KQD and sup-KQD—that recover sliced Wasserstein distances and interpolate between MMD and sliced Wasserstein frameworks.
The work presents near-linear time estimators with rigorous theoretical guarantees, demonstrating competitive performance in high-dimensional two-sample hypothesis testing.

This paper introduces Kernel Quantile Embeddings (KQEs) as a novel way to represent probability distributions in a Reproducing Kernel Hilbert Space (RKHS), offering an alternative to the widely used Kernel Mean Embeddings (KMEs). While KMEs represent a distribution as the mean function in an RKHS, KQEs leverage the concept of directional quantiles of the feature map $x \mapsto k(x, \cdot)$ . This approach is motivated by the fact that the set of all quantiles fully characterizes a probability distribution in one dimension.

The core idea is to first map data points from the input space $X$ into the RKHS $H$ via the kernel feature map $\psi(x) = k(x, \cdot)$ . This transforms the probability measure $P$ on $X$ into a pushforward measure $\psi \# P$ on $H$ . A KQE of $P$ for a given quantile level $\alpha \in [0, 1]$ and a direction $u$ in the unit sphere $S_H$ of the RKHS is defined as the $\alpha$ -quantile of the projected measure $\phi_u \# (\psi \# P)$ along the direction $u$ , where $\phi_u(h) = \langle u, h \rangle_H$ is the projection operator in $H$ . This results in an element $\rho_P^{\alpha,u} \in H$ (Equation 6) defined via its evaluation function $\rho_P^{\alpha,u}(x) = \rho^\alpha_{u \# P} u(x)$ .

A key theoretical contribution is the demonstration that a kernel $k$ is "quantile-characteristic" (meaning the mapping $P \mapsto \{\rho_P^{\alpha,u} : \alpha \in [0, 1], u \in S_H\}$ is injective) under weaker conditions (Hausdorff, separable, $\sigma$ -compact input space $X$ and continuous, separating kernel $k$ ) than those required for a kernel to be mean-characteristic (Theorem 1 and Theorem 2) (2505.20433). This has practical implications, as it means methods based on comparing KQEs can distinguish between a broader class of distributions than methods based on comparing KMEs, such as the Maximum Mean Discrepancy (MMD).

Based on KQEs, the paper proposes a family of probability metrics called Kernel Quantile Discrepancies (KQDs). Two primary types are introduced (Equation 9):

Expected KQD (e-KQD): Averages the difference between KQEs over directions $u \in S_H$ according to a measure $\gamma$ . $e-KQD_p(P, Q; \nu, \gamma) = \left(E_{u \sim \gamma} \left[\int_0^1 \big\| \rho_P^{\alpha,u} - \rho_Q^{\alpha,u} \big\|_H^p \nu(d \alpha) \right]\right)^{\nicefrac{1}{p}}$.
Supremum KQD (sup-KQD): Takes the supremum of the difference between KQEs over directions $u \in S_H$ . $sup-KQD_p(P, Q; \nu) = \big(\sup_{u \in S_H} \int_0^1 \big\| \rho_P^{\alpha,u} - \rho_Q^{\alpha,u} \big\|_H^p \nu(d \alpha) \big)^{\nicefrac{1}{p}}$. Here, $\nu$ is a weighting measure on $[0, 1]$ for different quantile levels $\alpha$ . The paper shows that both e-KQD and sup-KQD are probability metrics under the same mild conditions as quantile-characteristic kernels (Theorem 4) (2505.20433).

The paper establishes connections between KQDs and existing probability metrics:

When using a linear kernel $k(x, y) = x^\top y$ and taking $\nu$ as the Lebesgue measure, KQDs recover kernelized forms of Sliced Wasserstein (SW) and Max-Sliced Wasserstein (max-SW) distances (Connections 1 and 2) (2505.20433).
Centered versions of KQDs relate to a sum of MMD and kernelized sliced Wasserstein distances, suggesting they can interpolate between MMD and SW (Connection 3) (2505.20433).

A significant practical contribution is the development of an efficient estimator for e-KQD, particularly for $\gamma$ being a Gaussian measure on $H$ . Estimating the directional quantile $\rho_P^{\alpha,u}$ empirically involves computing the $\alpha$ -quantile of $\{u(x_i)\}_{i=1}^n$ for samples $x_{1:n} \sim P$ , which can be done efficiently using order statistics. The paper provides a consistency guarantee for this empirical KQE estimator (Theorem 3) (2505.20433), showing an $O(n^{-1/2})$ convergence rate under mild conditions.

The e-KQD estimator, presented in Algorithm 1, approximates the expectation over directions $u \sim \gamma$ using Monte Carlo sampling. To sample $u \in S_H$ from a Gaussian-induced measure $\gamma$ , the paper leverages the fact that sampling from a Gaussian measure on $H$ with a specific integral covariance operator can be reduced to sampling from a standard Gaussian in $\mathbb{R}^m$ and using samples $z_{1:m}$ from a reference measure $\xi$ on $X$ (Proposition 1) (2505.20433). The estimator then computes the quantile differences for each sampled direction and averages them.

Algorithm 1 Gaussian e-KQD Estimator (Simplified)
Input: Data x_1:n ~ P, y_1:n ~ Q, reference samples z_1:m ~ xi, kernel k, density f_nu, number of projections l, power p.
Initialize e-KQD^p = 0
For i = 1 to l:
  Sample lambda_1:m ~ N(0, Id_m)
  Compute f_i_x = lambda_1:m^T * k(z_1:m, x_1:n) / sqrt(m)  # Vector of values u_i(x_j) up to scale
  Compute f_i_y = lambda_1:m^T * k(z_1:m, y_1:n) / sqrt(m)  # Vector of values u_i(y_j) up to scale
  Compute ||f_i||_H = sqrt(lambda_1:m^T * k(z_1:m, z_1:m) * lambda_1:m / m) # Norm up to scale
  Compute u_i_x = f_i_x / ||f_i||_H  # Actual projected values u_i(x_j)
  Compute u_i_y = f_i_y / ||f_i||_H  # Actual projected values u_i(y_j)
  Sort u_i_x and u_i_y to get order statistics [u_i(x_1:n)]_j and [u_i(y_1:n)]_j
  Initialize tau_p_i^p = 0
  For j = 1 to n:
    tau_p_i^p += (| [u_i(x_1:n)]_j - [u_i(y_1:n)]_j |)^p * f_nu(ceil(j/n))
  e-KQD^p += tau_p_i^p / l
Return e-KQD^p^(1/p)

The computational complexity of this Gaussian e-KQD estimator is analyzed. With $l = O(\log n)$ projections and $m = O(\log n)$ reference samples, computing the projected values $u_i(x_{1:n})$ and $u_i(y_{1:n})$ takes $O(nm)$ time, computing the norm $\|f_i\|_H$ takes $O(m^2)$ , and sorting takes $O(n \log n)$ . Summing over $l$ projections gives a total complexity of $O(l \max(nm, m^2, n \log n))$ . By setting $l=m=O(\log n)$ , the complexity becomes $O(n \log^2 n)$ , which is near-linear in $n$ . This is significantly more efficient than the $O(n^2)$ complexity of standard U-statistic MMD estimators or $O(T n \log n)$ for optimizing max-SW/max-GSW, though generally slower than the $O(n)$ MMD-Linear estimator. The paper also provides a finite-sample consistency guarantee for the empirical e-KQD estimator, showing an $O(l^{-1/2} + n^{-1/2})$ rate (Theorem 5) (2505.20433).

The paper evaluates the proposed KQDs in the practical application of nonparametric two-sample hypothesis testing, comparing their performance (measured by rejection rate) against MMD and its fast approximations on synthetic and real-world datasets.

Power-decay experiment: e-KQD demonstrates better robustness to increasing dimensionality compared to MMD-Multi (a fast MMD approximation of similar complexity).
Laplace vs. Gaussian experiment: Using a polynomial kernel (which is not mean-characteristic but is quantile-characteristic), KQDs successfully distinguish between a Gaussian and a Laplace distribution with matching low-order moments, while MMD fails. This empirically verifies the theoretical finding on weaker characteristic conditions.
Real-world image data (Galaxy MNIST, CIFAR): On high-dimensional image data, the near-linear time e-KQD and sup-KQD estimators are competitive with or outperform fast MMD estimators of similar complexity. The quadratic-time centered e-KQD performs similarly to quadratic-time MMD.

The experimental results highlight that KQDs offer a compelling alternative to MMD for two-sample testing, providing competitive performance, particularly in high dimensions and scenarios where the kernel might not be mean-characteristic, while enabling efficient estimation. Future work could explore optimizing the choice of weighting measures $\nu$ and $\xi$ , developing improved estimators for KQEs, and extending the concepts to conditional settings.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/fx_briol/status/1927750828837773626

https://twitter.com/_onionesque/status/1928284902488674812

https://twitter.com/krikamol/status/1927766821144273121

https://twitter.com/fly51fly/status/1928930597310701910