Sliced Wasserstein Distance: Advances & Applications

Updated 21 November 2025

Sliced Wasserstein Distance is a metric that averages one-dimensional Wasserstein distances over random projections to compare probability distributions.
It reduces high-dimensional optimal transport to efficient O(N log N) one-dimensional sorting problems, bypassing the curse of dimensionality.
Widely applied in generative modeling, robust statistics, and inference, its extensions like Max-SW and EBSW improve precision and computational efficiency.

The sliced Wasserstein distance (SW), an integral tool in modern computational optimal transport, provides a scalable and statistically robust discrepancy measure between probability distributions by averaging one-dimensional Wasserstein distances over random projections. SW combines the geometric rigor of classic Wasserstein distances with significant algorithmic speedups, circumventing the computational intractability of high-dimensional multi-marginal optimal transport. Its conceptual simplicity and broad applicability have made it central to contemporary machine learning practice, particularly in generative modeling, robust statistics, and inference.

1. Mathematical Definition and Core Properties

Given two Borel probability measures $\mu,\nu$ on $\mathbb{R}^d$ with finite $p$ th moments, the $p$ -sliced Wasserstein distance is defined as the $L^p$ norm over the unit sphere $S^{d-1}$ of one-dimensional Wasserstein distances between the projected measures:

$SW_p(\mu,\nu) := \left( \int_{S^{d-1}} W_p^p((P_\theta)_\#\mu,\ (P_\theta)_\#\nu)\ d\sigma(\theta) \right)^{1/p},$

where $P_\theta(x) = \langle\theta, x\rangle$ projects to the line in direction $\theta$ , and $W_p$ is the classical one-dimensional Wasserstein distance. Equivalently, for empirical measures or those with densities, SW can be written with quantile functions:

$SW_p(\mu,\nu) = \left( \int_{S^{d-1}} \int_0^1 |F_\mu^{\theta,-1}(t) - F_\nu^{\theta,-1}(t)|^p dt\ d\sigma(\theta) \right)^{1/p}.$

SW is a bona fide metric (i.e., nonnegative, symmetric, satisfies the triangle inequality, and $SW_p(\mu,\nu)=0\iff\mu=\nu$ ), provided the set of projection directions is rich enough (injectivity of the slicing transform is ensured for the full unit sphere) (Kolouri et al., 2017).

2. Computational and Statistical Scalability

A central feature of SW is its computational tractability relative to the full $d$ -dimensional Wasserstein distance. While the classical Wasserstein distance requires solving a $d$ -dimensional OT problem, typically $O(N^3)$ for $N$ -point empirical distributions, SW reduces this task to numerous 1D problems, each solvable in $O(N\log N)$ via sorting. The integral over the sphere is numerically approximated using Monte Carlo:

$\widehat{SW}_p^p(\mu,\nu) = \frac{1}{L}\sum_{\ell=1}^L W_p^p\left( (P_{\theta_\ell})_\#\mu,\ (P_{\theta_\ell})_\#\nu \right),\ \theta_\ell \sim \text{Unif}(S^{d-1}),$

yielding total complexity $O(L N \log N)$ . Empirically, $L=100$ –$1000$ suffices in high dimensions (Kolouri et al., 2017, Rodríguez-Vítores et al., 24 Mar 2025).

SW avoids the curse of dimensionality: both estimation error and sample complexity do not worsen with $d$ , in contrast to classic OT metrics. For log-concave measures, the expected empirical SW error decays with $n^{-1/(2\vee p)}$ up to log factors in $n$ (Nietert et al., 2022). Uniform convergence rates under suitable moment and regularity assumptions are attainable for SW and its empirical variants (Rodríguez-Vítores et al., 24 Mar 2025, Nguyen et al., 2023).

3. Advanced Variants and Algorithmic Innovations

Recognizing limitations of standard SW—specifically, inefficiency when few projections are highly informative or the need to control estimator variance—numerous extensions and computational enhancements have been developed.

Max-Sliced, Distributional, and Energy-Based SW: Max-SW focuses on the most discriminative projection, DSW optimizes the slicing distribution under a diversity regularizer, and EBSW introduces an energy-based reweighting of projections, interpolating between uniform, DSW, and Max-SW (Nguyen et al., 2020, Nguyen et al., 2023). EBSW is a semi-metric, recovers SW as a special case, and, as temperature increases, transitions to Max-SW (Nguyen et al., 2023).
Hierarchical SW (HSW): To address projection bottlenecks in $d\gg n$ settings, HSW uses lower-dimensional bottleneck projections recursively composed to form random projections, reducing the computational expense and memory requirements without loss of metricity (Nguyen et al., 2022).
Augmented and Generalized SW: By embedding samples into high-dimensional learned feature spaces via neural networks, ASW and GSW drastically increase the flexibility of projections. For odd degree polynomials and certain neural parametrizations, deterministic, dimension-reduced approximations are available (Le et al., 2022, Chen et al., 2020).
Monte Carlo Variance Reduction: Use of control variates, orthogonal or stratified sampling, and importance reweighting mitigates the variance incurred by random projection Monte Carlo, enabling high-precision SW approximations with fewer projections (Nguyen et al., 2023, Rowland et al., 2019, Nguyen et al., 2023).

4. Statistical Inference and Asymptotics

A rigorous statistical theory is available for both the sample SW estimator and its use in inference. For $p>1$ under mild regularity (e.g., Sobolev–Jacobian control, absolute continuity, and connected support), a Central Limit Theorem holds:

$\sqrt{n}\left( SW_p^p(\widehat{P}_n, Q) - SW_p^p(P, Q)\right) \xrightarrow{d} N(0, v_{P,Q}^2),$

where $v_{P,Q}^2$ is computable in terms of projected optimal transport potentials (Rodríguez-Vítores et al., 24 Mar 2025). Notably, bias of the empirical estimator vanishes at the desired rate, enabling valid confidence intervals and hypothesis testing even for non-compactly supported distributions. Monte Carlo slicing errors are quantifiable, and variance estimation strategies for both data and projection sampling are established.

5. Robustness, Regularity, and Topological Considerations

SW inherits or improves upon many favorable properties of $W_p$ :

Robustness: Under contamination, minimax rates for SW and Max-SW can be dimension-free (for $p=1$ ), and robust mean estimation techniques immediately yield robust SW estimators (Nietert et al., 2022).
Weak Convergence and Metric Equivalence: SW metrizes weak convergence with finite $p$ th moments, and on compact supports is topologically equivalent to $W_p$ (with dimension-dependent equivalence constants) (Kolouri et al., 2017, Nietert et al., 2022).
Gradient Flows and Critical Points: The SW objective admits a variational calculus with well-posed Wasserstein gradient flows, useful in generative modeling and particle transport. Only absolutely continuous targets yield stable critical points; measures concentrated on affine subspaces or segments are unstable in the SW sense. Particle approximations and descent schemes are convergent and numerically stable when step-sizes are appropriately chosen (Vauthier et al., 10 Feb 2025).

6. Applications, Extensions, and Empirical Performance

SW has seen wide deployment:

Generative modeling: As a loss for training GANs, VAEs, and auto-encoders, both standard and augmented/learned SW blocks produce state-of-the-art sample quality and efficiency, especially in high dimension (Wu et al., 2017, Kolouri et al., 2017, Chen et al., 2020).
Distributional regression: SW enables both global and slice-wise Fréchet regression schemes for conditional multivariate distributions, facilitating interpretable density regression in settings with structured predictors (Chen et al., 2023).
Wasserstein regression surrogates: Linear regression of $W_p$ on SW-based features enables few-shot fast approximation of intractable pairwise OT costs, robustly outperforming embedding models when ground-truth is scarce (Nguyen et al., 24 Sep 2025).
Multi-marginal and meta-distributional OT: Sliced variants extend naturally to multi-marginal tasks (SMW), barycenters, meta-measures, and Wasserstein-over-Wasserstein analogs (DSW), yielding efficient, scalable metrics on distributions over distributions (Cohen et al., 2021, Piening et al., 26 Sep 2025).

7. Open Problems and Future Directions

Active directions include:

Theory of learned or nonlinear slicing functions, including injectivity and metricity for arbitrary neural parametrizations.
Sharp analysis of the error and bias of deterministic approximations under weak dependence or heavy-tailed data.
Stability and convergence rates for SW-based gradient flows in complex models.
Extensions to infinite-dimensional Hilbert space and Banach-space-valued data for functional data or stochastic processes (Han, 2023, Piening et al., 26 Sep 2025).
Characterization of optimal and adaptive slicing distributions beyond current energy-based, max-slicing, and learned-projection frameworks.

Sliced Wasserstein distances remain a central advance in computational transport, enabling rigorous inference and large-scale learning in high dimension and with structured or distributional data, while retaining both robust statistical properties and high computational efficiency (Rodríguez-Vítores et al., 24 Mar 2025, Kolouri et al., 2017, Nietert et al., 2022, Vauthier et al., 10 Feb 2025, Nguyen et al., 2020, Nguyen et al., 24 Sep 2025, Nguyen et al., 2022, Chen et al., 2023, Nguyen et al., 2023, Chen et al., 2020).