Sliced Wasserstein Kernels
- Sliced Wasserstein Kernels are positive-definite kernels derived by averaging one-dimensional Wasserstein distances over projections, capturing the geometric structure of distributions.
- They are constructed using a distance-substitution approach with Gaussian or Laplacian functions on Hilbertian metrics, ensuring both computational efficiency and universality.
- Empirical results show these kernels excel in tasks like image histogram classification, graph learning, and audio captioning, offering improved accuracy and robustness.
The Sliced Wasserstein kernel is a family of positive-definite kernels constructed from the Sliced Wasserstein (SW) distance—a measure derived from optimal transport that computes the average one-dimensional Wasserstein distance between projections (“slices”) of high-dimensional probability measures. Sliced Wasserstein kernels inherit geometric sensitivity from the Wasserstein distance, computational efficiency via closed-form 1D transport, and universality on the space of probability distributions, enabling effective learning over distributional inputs in kernel-based machine learning frameworks.
1. Definition and Mathematical Foundations
Let be probability measures on a compact domain . For each direction (the unit sphere), define the push-forward projections , where . The one-dimensional -Wasserstein distance between and admits the explicit formula: where is the generalized inverse CDF.
The Sliced Wasserstein distance of order is then
for the uniform measure on .
For , the feature map
exhibits the Hilbertian structure: Thus, is a genuine metric metrizing weak convergence on and allows for the construction of positive-definite (p.d.) kernels (Meunier et al., 2022, Kolouri et al., 2015).
2. Construction and Universality of Sliced Wasserstein Kernels
A distance-substitution kernel is constructed using a monotonically decreasing function (e.g., Gaussian/Laplacian RBF) and a Hilbertian distance : For SW, this leads to the Gaussian and Laplacian-style kernels: Schoenberg’s theorem guarantees positive definiteness when is (conditionally) negative-definite. By the Hilbertian property of , the resultant kernels are p.d.
When the underlying space is compact and metrizes the weak convergence, the corresponding Gaussian-type kernel is universal: its RKHS is dense in , enabling universal consistency in regression and classification (Meunier et al., 2022).
3. Algorithmic Implementation and Computational Considerations
Approximating involves:
- Sampling directions
- Sampling grid points
For a measure , the feature vector in is: Computing SW kernel matrix entries involves sorting projected samples (cost per direction, for empirical measures with points), feature evaluation (), and matrix evaluation ( for distributions).
Empirically, setting yields robust estimates. The kernel bandwidth is typically tuned via cross-validation (Meunier et al., 2022, Kolouri et al., 2015, Luong et al., 8 Feb 2025).
Monte-Carlo integration is necessary; the kernel approximation error decreases as . Unbiased variants further reduce estimation bias by averaging exponentials, enabling direct unbiased stochastic optimization (Luong et al., 8 Feb 2025).
4. Theoretical Guarantees: Consistency and Excess Risk
For kernel ridge regression (KRR) with SW kernels on empirical measures and associated responses : the estimator interpolates via the Gram matrix with entries .
Under the assumption that is universal and Hölder-continuous with respect to , the excess risk admits a bound that separates
- Stage-1 sampling complexity ( bags)
- Stage-2 sampling complexity ( samples/bag)
- The effective kernel dimension, and
- The regularization parameter
Concretely, if and the kernel/Lipschitz constants are suitably chosen,
Universal consistency follows as (Meunier et al., 2022).
5. Empirical Performance and Use Cases
SW kernels outperform MMD-based and standard RBF kernels in synthetic and real-world tasks where geometric structure is salient:
- Mode-counting (synthetic Gaussians): SW kernels yield 20–30% lower RMSE than MMD on predicting cluster count from samples.
- Image histogram classification: On MNIST and Fashion-MNIST (raw and perturbed), achieves accuracy vs. (MMD) and (RBF); under transformations, SW kernels maintain higher accuracy and robustness (Meunier et al., 2022).
- Graph learning: Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernels allow for positive definite, scalable, and accurate graph similarity measures, handling datasets with – nodes efficiently (Perez et al., 2024).
- Audio captioning: Unbiased SW-RBF kernels with temporal augmentations improve alignment and generation fidelity under stochastic sampling (Luong et al., 8 Feb 2025).
SW kernels are especially effective when mass-transport geometry correlates with the learning target, e.g., mode structure, geometric invariance, and robustness to perturbations.
6. Variants and Extensions
Max-Sliced and Tree-Sliced Wasserstein Kernels
- Max-sliced Wasserstein replaces the averaging over directions by maximization—optimizing a (potentially nonlinear) projection to maximize 1D Wasserstein between projected measures. While achieving sharper discrimination and dimension-free statistical rates, the projection optimization is NP-hard for but admits tight semidefinite relaxations (Wang et al., 2024).
- Tree-sliced Wasserstein kernels generalize the slice to arbitrary random tree metrics rather than lines (chains), averaging per-tree closed-form transports. This allows for flexible embedding of geometric structure and scalable, positive-definite kernels (Le et al., 2019).
Unbiased and Monte-Carlo Approximations
- The construction of unbiased estimators for the kernel value via Monte Carlo is crucial for stochastic gradient optimization and reducing estimator variance, with convergence rates with samples (Luong et al., 8 Feb 2025).
7. Practical Guidelines and Limitations
When to use SW kernels:
- For regression or classification where output depends on underlying geometry of distributions or empirical measures.
- For problems requiring robustness to support mismatch or geometric transformations (e.g., histogram classification, graph regression, multi-modality).
- When computational efficiency is critical: 1D transport is closed-form and scalable.
Limitations:
- MC estimation introduces additional stochastic error (i.e., three-stage sampling in distribution regression: bag, within-bag, and MC).
- Convergence rates for distribution regression under SW are slower in worst-case theory () compared to MMD (), but empirical geometry often compensates.
- Exact SW kernel evaluation is infeasible in very high-dimensional settings unless , are chosen adaptively.
Parameter tuning:
- Choose (projections/quantiles) to balance accuracy and efficiency.
- Kernel bandwidths (e.g., ) via cross-validation or median heuristics.
Sliced Wasserstein kernels fundamentally bridge geometric optimal transport and positive-definite kernel-based learning, providing a computationally tractable class of universal kernels well-suited for distributional data (Meunier et al., 2022, Kolouri et al., 2015, Perez et al., 2024, Luong et al., 8 Feb 2025, Le et al., 2019).