Sliced Wasserstein Distances
- Sliced Wasserstein distances are measures that decompose high-dimensional optimal transport into tractable one-dimensional comparisons using the Radon transform.
- They efficiently compute discrepancies between probability distributions, enabling applications in machine learning, computer vision, and generative modeling.
- Recent variants like generalized, augmented, and double-sliced methods enhance discrimination via nonlinear projections and adaptive slicing for robust statistical analysis.
The sliced Wasserstein distance is a family of optimal transport-inspired measures that enable efficient and theoretically principled comparisons between high-dimensional probability distributions by decomposing the problem into tractable one-dimensional subproblems. This approach is grounded in projecting the measures onto one-dimensional subspaces (slices) using the Radon transform, computing the Wasserstein (optimal transport) distance within each slice, and then aggregating the results. Sliced Wasserstein distances and their variants have become central tools across machine learning, statistics, computer vision, generative modeling, shape analysis, and beyond, due to their computational efficiency, flexibility, and strong theoretical guarantees.
1. Mathematical Formulation and Fundamental Principles
Let and be probability measures on . The classical -Wasserstein distance is defined as
However, is computationally challenging in high dimensions. The sliced Wasserstein distance mitigates this by projecting both measures onto one-dimensional lines and leveraging the closed-form solution for 1D optimal transport: where denotes projection onto direction , and is the uniform measure on the sphere.
For empirical measures, this integral is typically estimated via Monte Carlo by averaging over sampled directions.
The one-dimensional Wasserstein distance has an analytic solution via quantile functions: , where is the quantile function of .
Variants such as the max-sliced Wasserstein
and extensions involving nonlinear or higher-dimensional projections are common (Kolouri et al., 2015, Kolouri et al., 2019, Nguyen et al., 2020, Nietert et al., 2022, Carlier et al., 18 Oct 2025).
2. Construction and Theoretical Properties
Radon Transform and Injectivity
The slicing process is formalized via the Radon transform: which maps a function in to an ensemble of 1D projected functions. The invertibility of the Radon transform ensures that the sliced Wasserstein distance is a metric under mild conditions—if the collection of all slices agree for two measures, the measures are equal (Kolouri et al., 2015, Kolouri et al., 2019).
Metricity and Bounds
Sliced Wasserstein distances are proper metrics under injectivity of the slicing transform. For higher-order slicing (projections onto -planes), the generalization holds with analogous injectivity criteria (Carlier et al., 18 Oct 2025).
Recent work gives sharp quantitative bounds between and the classical on bounded supports: with the exponent $1/d$ being optimal (Carlier et al., 18 Oct 2025). Bounds improve to $1/(d-k+1)$ when slicing onto -planes.
Hilbert Embedding
For , the 1D Wasserstein metric is Hilbertian, enabling embedding spaces where sliced Wasserstein distances correspond to Euclidean norms (via an explicit nonlinear map), leading to conditionally negative definiteness and positive-definite kernels (Kolouri et al., 2015, Rustamov et al., 2020).
3. Variants and Extensions
Generalized, Augmented, and Energy-based Distances
- Generalized Sliced Wasserstein (GSW): Uses nonlinear projections via a generalized Radon transform , allowing for more flexible slicing and capturing complex geometric features with fewer projections. Metricity is retained when the generalized transform is injective (Kolouri et al., 2019).
- Augmented Sliced Wasserstein (ASWD): Uses a learnable injective mapping (typically a neural network) to map samples to higher-dimensional hypersurfaces before slicing, thereby achieving enhanced discrimination between complicated distributions (Chen et al., 2020).
- Energy-based Sliced Wasserstein (EBSW): The slicing distribution is reweighted by an energy function of the 1D Wasserstein distance, concentrating the integration over discriminative directions and improving efficiency and sensitivity (Nguyen et al., 2023).
Hierarchical and Double Slicing
- Hierarchical Sliced Wasserstein (HSW): Introduces bottleneck projections—reduce high-dimensional projections to a lower-dimensional subspace, then form numerous final projections via random mixtures, maintaining injectivity and improving computational scaling (Nguyen et al., 2022).
- Double-sliced Wasserstein (DSW): For meta-measures (measures over measures), the DSW integrates classical spherical slicing with “functional” slicing on the quantile function space . This approach offers both computational savings and stability for comparing datasets of distributions (Piening et al., 26 Sep 2025).
Manifold and Infinite-dimensional Extensions
- Intrinsic Sliced Wasserstein: Extends slicing to non-Euclidean domains by projecting along Laplace–Beltrami eigenfunctions or geodesics on manifolds (Rustamov et al., 2020, Bonet et al., 11 Mar 2024).
- Hilbert Spaces: The SWD is defined using projections on the unit sphere of a separable Hilbert space, equipped with an appropriate (possibly Gaussian-induced) measure, and retains metric properties and convergence results (Han, 2023).
4. Statistical and Computational Aspects
Estimation, Variance, and Central Limit Theory
- Monte Carlo integration is standard for estimating SWD, with error scaling as (number of projections) (Nietert et al., 2022, Nguyen et al., 2023).
- Variance can be significantly reduced using control variates constructed from 1D Gaussian approximations of the projected distributions (Nguyen et al., 2023).
- Orthogonal or stratified sampling of projections further improves variance and efficiency, leveraging connections to stratified Monte Carlo integration (Rowland et al., 2019).
- Sharp central limit theorems (CLT) for SWD—centered at the population value—enable confidence intervals and hypothesis testing, with bias correction and consistent asymptotic variance estimators specially adapted to high dimensionality and empirical measures (Rodríguez-Vítores et al., 24 Mar 2025, Xi et al., 2022).
Robustness
- Sliced Wasserstein distances (notably for ) exhibit strong robustness to data contamination, with dimension-free minimax risks and equivalence between robust mean estimation and estimation of the max , in contrast to the classic Wasserstein’s dependence (Nietert et al., 2022).
Computational Complexity
- SWD computation involves for projections and -point empirical measures, while full OT scales superlinearly with and is intractable with .
- Hierarchical and augmented slicing or adaptive slicing distributions further improve computational complexity and convergence (Nguyen et al., 2022, Chen et al., 2020, Nguyen et al., 2023).
5. Algorithmic Applications
Sliced Wasserstein distances and kernels are used as geometric discrepancy measures in a variety of modern machine learning tasks:
| Application Domain | Role of Sliced Wasserstein Distance |
|---|---|
| Kernel methods | Gaussian SWD kernels underlie positive-definite, interpretable kernels for SVMs, k-means |
| Clustering, Dim. Reduct. | Kernel k-means and kernel PCA in density feature space via invertible nonlinear maps |
| Generative modeling | SWD and its adaptive blocks accelerate and stabilize GANs, VAEs, flows [(Wu et al., 2017), etc.] |
| Robust estimation | SWD-based estimators maintain stability and robustness to outliers and contamination |
| Manifold learning | Intrinsic and Cartan–Hadamard SWD generalize OT to non-Euclidean and graph domains |
| Statistical inference | CLTs, variance estimation, and confidence intervals via SWD for empirical measures |
| Meta-distribution analysis | Double-sliced and WoW distances for comparing datasets and graphs (Piening et al., 26 Sep 2025) |
Within these domains, the sliced approach enables both greater scalability (by reducing computational cost) and improved statistical behavior (by avoiding the curse of dimensionality). Due to injectivity and Hilbertian structures, SWD-based losses and kernels come with invertibility, easier optimization, and often more interpretable geometric structure (Kolouri et al., 2015, Nguyen et al., 2022, Rustamov et al., 2020).
6. Regression, Meta-Analysis, and Extensions
Sliced Wasserstein Regression
Distribution regression for vector predictors and multivariate response distributions leverages the Radon slicing transform to reduce multivariate optimal transport to convex, tractable 1D regression problems. Global (slice-averaged) and slice-wise Fréchet regressions both enjoy provable convergence rates (parametric for SAW, slightly subparametric for SWW due to reconstruction bias) and practical algorithms based on kernel-weighted empirical risk minimization and regularized inversion of the Radon transform. Applications include regression of joint distributions in climate and finance, with simulations showing that slice-wise approaches outperform aggregated methods when the conditional structure is complex (Chen et al., 2023).
Efficient Estimation and Surrogates
Sliced Wasserstein distances can be used as surrogates for the full Wasserstein distance by regression: for example, linear models trained on a small calibration set of pairs can predict from SWD and its upper/lower bounds, efficiently replacing the need for large-scale optimization in downstream applications, such as point cloud classification or deep optimal transport model training (Nguyen et al., 24 Sep 2025).
Double-sliced and Meta-Measure Distances
The double-sliced Wasserstein metric (DSW) compares meta-measures by combining classical slicing in with L₂ slicing on the quantile-function isometry in 1D, yielding scalable and stable “distance between datasets” for images, point clouds, and shapes (Piening et al., 26 Sep 2025).
7. Limitations, Sharpness, and Theoretical Comparisons
While sliced Wasserstein distances provide precise, scalable proxies for , there are inherent quantitative limits to how well they can approximate the full transport cost: with this exponent being unimprovable in general, and better rates only available under additional structure (e.g., symmetry) (Carlier et al., 18 Oct 2025).
While slicing along higher-dimensional -planes tightens the bound (with exponent $1/(d-k+1)$), computational savings diminish.
These theoretical results clarify both the power and inherent losses of dimensional reduction by slicing, justifying when and how SWD should be deployed as a surrogate for full optimal transport in high-dimensional or geometric data analysis.
Sliced Wasserstein distances thus form a flexible, theoretically grounded, and computationally tractable toolkit for comparing, analyzing, and modeling probability measures and structured datasets, with ongoing research devoted to sharpening theoretical comparisons, extending to new domains (manifolds, Hilbert spaces, meta-measures), and devising improved approximation and optimization schemes.