Sliced Wasserstein Kernels

Updated 21 November 2025

Sliced Wasserstein Kernels are positive-definite kernels derived by averaging one-dimensional Wasserstein distances over projections, capturing the geometric structure of distributions.
They are constructed using a distance-substitution approach with Gaussian or Laplacian functions on Hilbertian metrics, ensuring both computational efficiency and universality.
Empirical results show these kernels excel in tasks like image histogram classification, graph learning, and audio captioning, offering improved accuracy and robustness.

The Sliced Wasserstein kernel is a family of positive-definite kernels constructed from the Sliced Wasserstein (SW) distance—a measure derived from optimal transport that computes the average one-dimensional Wasserstein distance between projections (“slices”) of high-dimensional probability measures. Sliced Wasserstein kernels inherit geometric sensitivity from the Wasserstein distance, computational efficiency via closed-form 1D transport, and universality on the space of probability distributions, enabling effective learning over distributional inputs in kernel-based machine learning frameworks.

1. Definition and Mathematical Foundations

Let $\mu,\nu$ be probability measures on a compact domain $\Omega\subset\mathbb{R}^d$ . For each direction $\theta\in S^{d-1}$ (the unit sphere), define the push-forward projections $\mu_\theta = \theta^*_{\sharp}\mu$ , where $\theta^*(x) = \langle\theta, x\rangle$ . The one-dimensional $p$ -Wasserstein distance between $\mu_\theta$ and $\nu_\theta$ admits the explicit formula: $W_p(\mu_\theta,\nu_\theta) = \|F_{\mu_\theta}^{[-1]} - F_{\nu_\theta}^{[-1]}\|_{L^p([0,1])}$ where $F_\mu^{[-1]}$ is the generalized inverse CDF.

The Sliced Wasserstein distance of order $p$ is then

$d_{\mathrm{SW},p}(\mu, \nu)^p = \int_{S^{d-1}} W_p(\mu_\theta, \nu_\theta)^p\, d\sigma(\theta)$

for the uniform measure $\sigma$ on $S^{d-1}$ .

For $p=2$ , the feature map

$\Phi(\mu)\colon (t, \theta) \mapsto F_{\mu_\theta}^{[-1]}(t) \in L^2([0,1]\times S^{d-1})$

exhibits the Hilbertian structure: $\|\Phi(\mu) - \Phi(\nu)\|^2 = d_{\mathrm{SW},2}(\mu,\nu)^2$ Thus, $d_{\mathrm{SW},2}$ is a genuine metric metrizing weak convergence on $M_+^1(\Omega)$ and allows for the construction of positive-definite (p.d.) kernels (Meunier et al., 2022, Kolouri et al., 2015).

2. Construction and Universality of Sliced Wasserstein Kernels

A distance-substitution kernel is constructed using a monotonically decreasing function $q$ (e.g., Gaussian/Laplacian RBF) and a Hilbertian distance $d$ : $K(\mu, \nu) = q(d(\mu,\nu))$ For SW, this leads to the Gaussian and Laplacian-style kernels: $K_1(\mu, \nu) = \exp(-\gamma d_{\mathrm{SW},1}(\mu,\nu)), \quad K_2(\mu, \nu) = \exp(-\gamma d_{\mathrm{SW},2}(\mu,\nu)^2), \quad \gamma > 0$ Schoenberg’s theorem guarantees positive definiteness when $d$ is (conditionally) negative-definite. By the Hilbertian property of $d_{\mathrm{SW},p}$ , the resultant kernels are p.d.

When the underlying space $X$ is compact and $d_{\mathrm{SW},p}$ metrizes the weak convergence, the corresponding Gaussian-type kernel is universal: its RKHS is dense in $C(X)$ , enabling universal consistency in regression and classification (Meunier et al., 2022).

3. Algorithmic Implementation and Computational Considerations

Approximating $d_{\mathrm{SW},p}$ involves:

Sampling $M$ directions $\theta_1, ..., \theta_M \sim \mathrm{Unif}(S^{d-1})$
Sampling $N$ grid points $t_1,\dots,t_N \sim \mathrm{Unif}[0,1]$

For a measure $\mu$ , the feature vector in $\mathbb{R}^{M \times N}$ is: $\hat\Phi_{M,N}(\mu) = \left(F_{\mu_{\theta_m}}^{[-1]}(t_\ell)\right)_{m=1,\dots, M;\;\ell=1,\dots,N}$ Computing SW kernel matrix entries involves sorting projected samples (cost $O(n \log n)$ per direction, for empirical measures with $n$ points), feature evaluation ( $O(MN)$ ), and matrix evaluation ( $O(T^2 M N)$ for $T$ distributions).

Empirically, setting $M,N\approx 50-200$ yields robust estimates. The kernel bandwidth $\gamma$ is typically tuned via cross-validation (Meunier et al., 2022, Kolouri et al., 2015, Luong et al., 8 Feb 2025).

Monte-Carlo integration is necessary; the kernel approximation error decreases as $O(M^{-1/2})$ . Unbiased variants further reduce estimation bias by averaging exponentials, enabling direct unbiased stochastic optimization (Luong et al., 8 Feb 2025).

4. Theoretical Guarantees: Consistency and Excess Risk

For kernel ridge regression (KRR) with SW kernels on empirical measures $\{\hat{P}_{t,n_t}\}$ and associated responses $y_t$ : $\min_{f\in H_K}\frac{1}{T}\sum_{t=1}^T (f(\hat{P}_{t,n_t}) - y_t)^2 + \lambda \|f\|_{H_K}^2$ the estimator interpolates via the Gram matrix $K_T$ with entries $K(\hat{P}_{t,n_t}, \hat{P}_{s,n_s})$ .

Under the assumption that $K$ is universal and Hölder-continuous with respect to $d_{\mathrm{SW},p}$ , the excess risk admits a bound that separates

Stage-1 sampling complexity ( $T$ bags)
Stage-2 sampling complexity ( $n$ samples/bag)
The effective kernel dimension, and
The regularization parameter $\lambda$

Concretely, if $\alpha(n) = \mathbb{E}[d(\mu,\hat{\mu}_n)^2] = O(n^{-\beta})$ and the kernel/Lipschitz constants are suitably chosen,

$\mathbb{E}[\|f_{\hat D,\lambda}-f_\rho\|] = O(T^{-1/4} + n^{-(h\beta)/4})$

Universal consistency follows as $T, n \rightarrow \infty$ (Meunier et al., 2022).

5. Empirical Performance and Use Cases

SW kernels outperform MMD-based and standard RBF kernels in synthetic and real-world tasks where geometric structure is salient:

Mode-counting (synthetic Gaussians): SW kernels yield 20–30% lower RMSE than MMD on predicting cluster count from samples.
Image histogram classification: On MNIST and Fashion-MNIST (raw and perturbed), $K_{\mathrm{SW}_2}$ achieves $\approx 93\%$ accuracy vs. $\approx 79\%$ (MMD) and $\approx 90\%$ (RBF); under transformations, SW kernels maintain higher accuracy and robustness (Meunier et al., 2022).
Graph learning: Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernels allow for positive definite, scalable, and accurate graph similarity measures, handling datasets with $10^4$ – $10^5$ nodes efficiently (Perez et al., 2024).
Audio captioning: Unbiased SW-RBF kernels with temporal augmentations improve alignment and generation fidelity under stochastic sampling (Luong et al., 8 Feb 2025).

SW kernels are especially effective when mass-transport geometry correlates with the learning target, e.g., mode structure, geometric invariance, and robustness to perturbations.

6. Variants and Extensions

Max-Sliced and Tree-Sliced Wasserstein Kernels

Max-sliced Wasserstein replaces the averaging over directions by maximization—optimizing a (potentially nonlinear) projection to maximize 1D Wasserstein between projected measures. While achieving sharper discrimination and dimension-free statistical rates, the projection optimization is NP-hard for $p=2$ but admits tight semidefinite relaxations (Wang et al., 2024).
Tree-sliced Wasserstein kernels generalize the slice to arbitrary random tree metrics rather than lines (chains), averaging per-tree closed-form transports. This allows for flexible embedding of geometric structure and scalable, positive-definite kernels (Le et al., 2019).

Unbiased and Monte-Carlo Approximations

The construction of unbiased estimators for the kernel value via Monte Carlo is crucial for stochastic gradient optimization and reducing estimator variance, with convergence rates $O(L^{-1/2})$ with $L$ samples (Luong et al., 8 Feb 2025).

7. Practical Guidelines and Limitations

When to use SW kernels:

For regression or classification where output depends on underlying geometry of distributions or empirical measures.
For problems requiring robustness to support mismatch or geometric transformations (e.g., histogram classification, graph regression, multi-modality).
When computational efficiency is critical: 1D transport is closed-form and scalable.

Limitations:

MC estimation introduces additional stochastic error (i.e., three-stage sampling in distribution regression: bag, within-bag, and MC).
Convergence rates for distribution regression under SW are slower in worst-case theory ( $n^{-1/4}$ ) compared to MMD ( $n^{-1/2}$ ), but empirical geometry often compensates.
Exact SW kernel evaluation is infeasible in very high-dimensional settings unless $M$ , $N$ are chosen adaptively.

Parameter tuning:

Choose $M,N$ (projections/quantiles) to balance accuracy and efficiency.
Kernel bandwidths (e.g., $\gamma$ ) via cross-validation or median heuristics.

Sliced Wasserstein kernels fundamentally bridge geometric optimal transport and positive-definite kernel-based learning, providing a computationally tractable class of universal kernels well-suited for distributional data (Meunier et al., 2022, Kolouri et al., 2015, Perez et al., 2024, Luong et al., 8 Feb 2025, Le et al., 2019).