Papers
Topics
Authors
Recent
2000 character limit reached

Sliced Wasserstein Kernels

Updated 21 November 2025
  • Sliced Wasserstein Kernels are positive-definite kernels derived by averaging one-dimensional Wasserstein distances over projections, capturing the geometric structure of distributions.
  • They are constructed using a distance-substitution approach with Gaussian or Laplacian functions on Hilbertian metrics, ensuring both computational efficiency and universality.
  • Empirical results show these kernels excel in tasks like image histogram classification, graph learning, and audio captioning, offering improved accuracy and robustness.

The Sliced Wasserstein kernel is a family of positive-definite kernels constructed from the Sliced Wasserstein (SW) distance—a measure derived from optimal transport that computes the average one-dimensional Wasserstein distance between projections (“slices”) of high-dimensional probability measures. Sliced Wasserstein kernels inherit geometric sensitivity from the Wasserstein distance, computational efficiency via closed-form 1D transport, and universality on the space of probability distributions, enabling effective learning over distributional inputs in kernel-based machine learning frameworks.

1. Definition and Mathematical Foundations

Let μ,ν\mu,\nu be probability measures on a compact domain ΩRd\Omega\subset\mathbb{R}^d. For each direction θSd1\theta\in S^{d-1} (the unit sphere), define the push-forward projections μθ=θμ\mu_\theta = \theta^*_{\sharp}\mu, where θ(x)=θ,x\theta^*(x) = \langle\theta, x\rangle. The one-dimensional pp-Wasserstein distance between μθ\mu_\theta and νθ\nu_\theta admits the explicit formula: Wp(μθ,νθ)=Fμθ[1]Fνθ[1]Lp([0,1])W_p(\mu_\theta,\nu_\theta) = \|F_{\mu_\theta}^{[-1]} - F_{\nu_\theta}^{[-1]}\|_{L^p([0,1])} where Fμ[1]F_\mu^{[-1]} is the generalized inverse CDF.

The Sliced Wasserstein distance of order pp is then

dSW,p(μ,ν)p=Sd1Wp(μθ,νθ)pdσ(θ)d_{\mathrm{SW},p}(\mu, \nu)^p = \int_{S^{d-1}} W_p(\mu_\theta, \nu_\theta)^p\, d\sigma(\theta)

for the uniform measure σ\sigma on Sd1S^{d-1}.

For p=2p=2, the feature map

Φ(μ) ⁣:(t,θ)Fμθ[1](t)L2([0,1]×Sd1)\Phi(\mu)\colon (t, \theta) \mapsto F_{\mu_\theta}^{[-1]}(t) \in L^2([0,1]\times S^{d-1})

exhibits the Hilbertian structure: Φ(μ)Φ(ν)2=dSW,2(μ,ν)2\|\Phi(\mu) - \Phi(\nu)\|^2 = d_{\mathrm{SW},2}(\mu,\nu)^2 Thus, dSW,2d_{\mathrm{SW},2} is a genuine metric metrizing weak convergence on M+1(Ω)M_+^1(\Omega) and allows for the construction of positive-definite (p.d.) kernels (Meunier et al., 2022, Kolouri et al., 2015).

2. Construction and Universality of Sliced Wasserstein Kernels

A distance-substitution kernel is constructed using a monotonically decreasing function qq (e.g., Gaussian/Laplacian RBF) and a Hilbertian distance dd: K(μ,ν)=q(d(μ,ν))K(\mu, \nu) = q(d(\mu,\nu)) For SW, this leads to the Gaussian and Laplacian-style kernels: K1(μ,ν)=exp(γdSW,1(μ,ν)),K2(μ,ν)=exp(γdSW,2(μ,ν)2),γ>0K_1(\mu, \nu) = \exp(-\gamma d_{\mathrm{SW},1}(\mu,\nu)), \quad K_2(\mu, \nu) = \exp(-\gamma d_{\mathrm{SW},2}(\mu,\nu)^2), \quad \gamma > 0 Schoenberg’s theorem guarantees positive definiteness when dd is (conditionally) negative-definite. By the Hilbertian property of dSW,pd_{\mathrm{SW},p}, the resultant kernels are p.d.

When the underlying space XX is compact and dSW,pd_{\mathrm{SW},p} metrizes the weak convergence, the corresponding Gaussian-type kernel is universal: its RKHS is dense in C(X)C(X), enabling universal consistency in regression and classification (Meunier et al., 2022).

3. Algorithmic Implementation and Computational Considerations

Approximating dSW,pd_{\mathrm{SW},p} involves:

  • Sampling MM directions θ1,...,θMUnif(Sd1)\theta_1, ..., \theta_M \sim \mathrm{Unif}(S^{d-1})
  • Sampling NN grid points t1,,tNUnif[0,1]t_1,\dots,t_N \sim \mathrm{Unif}[0,1]

For a measure μ\mu, the feature vector in RM×N\mathbb{R}^{M \times N} is: Φ^M,N(μ)=(Fμθm[1](t))m=1,,M;  =1,,N\hat\Phi_{M,N}(\mu) = \left(F_{\mu_{\theta_m}}^{[-1]}(t_\ell)\right)_{m=1,\dots, M;\;\ell=1,\dots,N} Computing SW kernel matrix entries involves sorting projected samples (cost O(nlogn)O(n \log n) per direction, for empirical measures with nn points), feature evaluation (O(MN)O(MN)), and matrix evaluation (O(T2MN)O(T^2 M N) for TT distributions).

Empirically, setting M,N50200M,N\approx 50-200 yields robust estimates. The kernel bandwidth γ\gamma is typically tuned via cross-validation (Meunier et al., 2022, Kolouri et al., 2015, Luong et al., 8 Feb 2025).

Monte-Carlo integration is necessary; the kernel approximation error decreases as O(M1/2)O(M^{-1/2}). Unbiased variants further reduce estimation bias by averaging exponentials, enabling direct unbiased stochastic optimization (Luong et al., 8 Feb 2025).

4. Theoretical Guarantees: Consistency and Excess Risk

For kernel ridge regression (KRR) with SW kernels on empirical measures {P^t,nt}\{\hat{P}_{t,n_t}\} and associated responses yty_t: minfHK1Tt=1T(f(P^t,nt)yt)2+λfHK2\min_{f\in H_K}\frac{1}{T}\sum_{t=1}^T (f(\hat{P}_{t,n_t}) - y_t)^2 + \lambda \|f\|_{H_K}^2 the estimator interpolates via the Gram matrix KTK_T with entries K(P^t,nt,P^s,ns)K(\hat{P}_{t,n_t}, \hat{P}_{s,n_s}).

Under the assumption that KK is universal and Hölder-continuous with respect to dSW,pd_{\mathrm{SW},p}, the excess risk admits a bound that separates

  • Stage-1 sampling complexity (TT bags)
  • Stage-2 sampling complexity (nn samples/bag)
  • The effective kernel dimension, and
  • The regularization parameter λ\lambda

Concretely, if α(n)=E[d(μ,μ^n)2]=O(nβ)\alpha(n) = \mathbb{E}[d(\mu,\hat{\mu}_n)^2] = O(n^{-\beta}) and the kernel/Lipschitz constants are suitably chosen,

E[fD^,λfρ]=O(T1/4+n(hβ)/4)\mathbb{E}[\|f_{\hat D,\lambda}-f_\rho\|] = O(T^{-1/4} + n^{-(h\beta)/4})

Universal consistency follows as T,nT, n \rightarrow \infty (Meunier et al., 2022).

5. Empirical Performance and Use Cases

SW kernels outperform MMD-based and standard RBF kernels in synthetic and real-world tasks where geometric structure is salient:

  • Mode-counting (synthetic Gaussians): SW kernels yield 20–30% lower RMSE than MMD on predicting cluster count from samples.
  • Image histogram classification: On MNIST and Fashion-MNIST (raw and perturbed), KSW2K_{\mathrm{SW}_2} achieves 93%\approx 93\% accuracy vs. 79%\approx 79\% (MMD) and 90%\approx 90\% (RBF); under transformations, SW kernels maintain higher accuracy and robustness (Meunier et al., 2022).
  • Graph learning: Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernels allow for positive definite, scalable, and accurate graph similarity measures, handling datasets with 10410^410510^5 nodes efficiently (Perez et al., 2024).
  • Audio captioning: Unbiased SW-RBF kernels with temporal augmentations improve alignment and generation fidelity under stochastic sampling (Luong et al., 8 Feb 2025).

SW kernels are especially effective when mass-transport geometry correlates with the learning target, e.g., mode structure, geometric invariance, and robustness to perturbations.

6. Variants and Extensions

Max-Sliced and Tree-Sliced Wasserstein Kernels

  • Max-sliced Wasserstein replaces the averaging over directions by maximization—optimizing a (potentially nonlinear) projection to maximize 1D Wasserstein between projected measures. While achieving sharper discrimination and dimension-free statistical rates, the projection optimization is NP-hard for p=2p=2 but admits tight semidefinite relaxations (Wang et al., 2024).
  • Tree-sliced Wasserstein kernels generalize the slice to arbitrary random tree metrics rather than lines (chains), averaging per-tree closed-form transports. This allows for flexible embedding of geometric structure and scalable, positive-definite kernels (Le et al., 2019).

Unbiased and Monte-Carlo Approximations

  • The construction of unbiased estimators for the kernel value via Monte Carlo is crucial for stochastic gradient optimization and reducing estimator variance, with convergence rates O(L1/2)O(L^{-1/2}) with LL samples (Luong et al., 8 Feb 2025).

7. Practical Guidelines and Limitations

When to use SW kernels:

  • For regression or classification where output depends on underlying geometry of distributions or empirical measures.
  • For problems requiring robustness to support mismatch or geometric transformations (e.g., histogram classification, graph regression, multi-modality).
  • When computational efficiency is critical: 1D transport is closed-form and scalable.

Limitations:

  • MC estimation introduces additional stochastic error (i.e., three-stage sampling in distribution regression: bag, within-bag, and MC).
  • Convergence rates for distribution regression under SW are slower in worst-case theory (n1/4n^{-1/4}) compared to MMD (n1/2n^{-1/2}), but empirical geometry often compensates.
  • Exact SW kernel evaluation is infeasible in very high-dimensional settings unless MM, NN are chosen adaptively.

Parameter tuning:

  • Choose M,NM,N (projections/quantiles) to balance accuracy and efficiency.
  • Kernel bandwidths (e.g., γ\gamma) via cross-validation or median heuristics.

Sliced Wasserstein kernels fundamentally bridge geometric optimal transport and positive-definite kernel-based learning, providing a computationally tractable class of universal kernels well-suited for distributional data (Meunier et al., 2022, Kolouri et al., 2015, Perez et al., 2024, Luong et al., 8 Feb 2025, Le et al., 2019).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sliced Wasserstein Kernels.