Sliced Wasserstein Distance

Updated 16 September 2025

Sliced Wasserstein Distance is a metric that compares probability measures by projecting high-dimensional data onto one-dimensional subspaces for efficient computation.
This approach leverages the Radon transform and quantile functions to achieve closed-form one-dimensional optimal transport, resulting in computational scalability and robust estimation.
SW is widely applied in generative modeling, clustering, regression, and streaming analytics, with ongoing research into adaptive slicing and hierarchical extensions.

The Sliced Wasserstein Distance (SW) is a metric for comparing probability measures on high-dimensional spaces by leveraging the closed-form solutions available for one-dimensional optimal transport. By projecting high-dimensional distributions onto one-dimensional subspaces and aggregating the corresponding Wasserstein distances, SW achieves computational and statistical scalability superior to the classic Wasserstein distance. The mathematical and algorithmic developments around SW have led to numerous generalizations, robust estimation results, fast approximation techniques, and applications spanning machine learning, optimal transport, generative modeling, and geometric measure theory.

1. Mathematical Definition and Structure

For probability measures $\mu, \nu$ on $\mathbb{R}^d$ with finite $p$ -th moments ( $p \geq 1$ ), the sliced Wasserstein- $p$ distance is

$\mathrm{SW}_p(\mu, \nu) = \left( \int_{S^{d-1}} W_p^p(\theta_\# \mu, \theta_\# \nu)\, d\sigma(\theta) \right)^{1/p},$

where $\theta_\#\mu$ is the pushforward of $\mu$ by projection onto direction $\theta \in S^{d-1}$ , $W_p$ denotes the $p$ -Wasserstein distance in one dimension, and $\sigma$ is the uniform measure on the sphere. The one-dimensional Wasserstein distance is efficiently computable via quantile functions: $W_p^p(\mu, \nu) = \int_0^1 |F_\mu^{-1}(u) - F_\nu^{-1}(u)|^p du,$ where $F_\mu^{-1}$ denotes the quantile function of $\mu$ .

The SW distance inherits non-negativity, symmetry, and the identity of indiscernibles from $W_p$ , and is a true metric under mild assumptions, notably injectivity of the Radon transform (Kolouri et al., 2019).

2. Relation to the Radon Transform and Generalizations

The computation of SW relies fundamentally on the Radon transform,

$\mathcal{R}f(t, \theta) = \int_{\mathbb{R}^d} f(x)\, \delta(t - \langle x, \theta \rangle) dx,$

which maps $d$ -dimensional measures to their one-dimensional projections. This linearization enables closed-form solutions to otherwise intractable optimal transport calculations.

Generalizations—such as the Generalized Sliced Wasserstein (GSW) distance—replace the linear dot product with nonlinear defining functions $g(x, \theta)$ . The corresponding generalized Radon transform $\mathcal{G}I(t,\theta)$ extends the expressivity of slicing, enabling SW-like distances to adapt to nonlinear structures in high-dimensional data while ensuring metricity when $\mathcal{G}$ is injective (Kolouri et al., 2019).

Nonlinear slicing induces metrics sensitive to higher-order interactions and manifold structure. The maximum GSW (max-GSW) further sharpens this expressivity by selecting the most discriminative slicing direction, albeit at increased computational cost.

3. Analytical, Geometric, and Statistical Properties

The geometry of the SW space differs substantially from that of classic Wasserstein space. For measures with densities both upper and lower bounded by multiples of a fixed reference measure, SW is quantitatively equivalent to the homogeneous negative Sobolev norm $\dot H^{-(d+1)/2}(\mathbb{R}^d)$ : $C_1 \|\mu-\nu\|_{\dot H^{-(d+1)/2}} \leq \mathrm{SW}(\mu, \nu) \leq C_2 \|\mu-\nu\|_{\dot H^{-(d+1)/2}},$ with constants determined by density bounds and the reference measure (Park et al., 2023). When measures are close to being discrete (in $W_\infty$ ), SW approaches a constant multiple of the Wasserstein metric.

The tangent space structure of $(P_2(\mathbb{R}^d), \mathrm{SW})$ can be described via quadratic forms linked to Benamou–Brenier energies, but SW does not yield a geodesic space—its intrinsic “length” metric is strictly larger than the usual SW distance, but geodesic curves minimizing this length exist (Park et al., 2023).

Empirical convergence rates for SW are nearly parametric—even for high-dimensional log-concave measures, the expected empirical estimation error scales as $O((\log n)^{\mathbf{1}_{\{p=2\}}/2} / n^{1/(2\vee p)})$ (Nietert et al., 2022). The sample complexity is dimension-free, and robust estimation under contamination does not suffer from the $\sqrt{d}$ curse typical of classical Wasserstein estimation.

Central limit theorems for $\mathrm{SW}_1$ as a functional of the empirical cumulative distribution function (CDF) process have been obtained using Banach-space CLT techniques, and P-Donsker properties for classes associated with max-sliced Wasserstein yield asymptotic normality and provide the statistical underpinnings for nonparametric two-sample testing (Xu et al., 2022).

4. Computational Methods and Scalability

A defining benefit of SW is computational scalability. Each one-dimensional projection requires $O(n \log n)$ sorting for $n$ samples, and the dimension only appears in the projection step. Monte Carlo estimation is standard, with explicit error-probability bounds and rates for the number of projections needed to achieve a prescribed accuracy (Xu et al., 2022, Nietert et al., 2022).

Advanced estimators improve variance and computational cost:

Deterministic Approximations: Using concentration-of-measure, high-dimensional projections of random vectors are approximately Gaussian. The closed-form distance between projected Gaussians can thus replace the Monte Carlo average for SW, with provable approximation errors that vanish as $d \to \infty$ (Nadjahi et al., 2021).
Control Variates: Explicit variance reductions based on Gaussian approximations of projected measures or polynomial expansions (such as using spherical harmonics as control variates) yield faster convergence and reduced estimator error (Nguyen et al., 2023, Leluc et al., 2 Feb 2024).
Streaming Approaches: Streaming Sliced Wasserstein (Stream-SW) adapts the distance to settings where samples arrive sequentially and full storage is intractable. By leveraging quantile sketches for one-dimensional projections, Stream-SW achieves low memory complexity and theoretical guarantees on streaming estimation error (Nguyen, 11 May 2025).

Hierarchical, Markovian, and distributional variants of SW further improve projection efficiency, reduce redundancy, and adaptively select informative projection directions with theoretical guarantees, often leveraging PAC-Bayesian or optimization-driven frameworks for adaptive slicing (Ohana et al., 2022, Nguyen et al., 2023, Nguyen et al., 2020).

5. Extensions: Spherical, Hierarchical, and Nonlinear Slicing

Spherical Sliced-Wasserstein (SSW) adapts the projection-slicing paradigm to manifold-valued data, notably the hypersphere $S^{d-1}$ . By projecting onto great circles and leveraging closed-form solutions for Wasserstein distance on $S^1$ , SSW extends SW to directions and applications not amenable to Euclidean slicing, including density estimation on the sphere and hyperspherical representation learning (Bonet et al., 2022).

Data-adaptive weightings of projection directions—such as in Discriminative Spherical Sliced Wasserstein (DSSW)—use deterministic or neural energy functions applied to projected Wasserstein distances to enhance discrimination and task-specific sensitivity without substantial computational overhead (Zhang et al., 26 Dec 2024).

Tree-sliced Wasserstein (TSW) replaces one-dimensional projections with tree metric spaces, incorporating splitting mechanisms across branches and enabling nonlinear projections that capture richer topological and positional information, applicable in both Euclidean and spherical domains (Tran et al., 2 May 2025).

Hierarchical Sliced Wasserstein (HSW) employs a two-stage projection process, combining bottleneck and mixing projections to reduce computational and memory complexity in very high-dimensional or data-sparse settings (Nguyen et al., 2022).

6. Applications: Learning, Modeling, and Inference

The SW distance is implemented as a core loss, similarity, or regularization term in tasks requiring probabilistic or geometric fidelity:

Generative Modeling: SW (and its generalizations) is used in adversarial frameworks, minimum expected distance estimation, and autoencoders for matching generated and target distributions. Incorporation of adaptive or energy-based projection selection reduces the number of needed projections and improves sample and FID performance (Kolouri et al., 2015, Nguyen et al., 2020, Nguyen et al., 2023, Tran et al., 2 May 2025).
Clustering and Dimensionality Reduction: Positive definite kernels derived from SW can be used as similarity measures in kernel k-means, kernel PCA, providing more effective clustering and representation than standard Euclidean or RBF kernels (Kolouri et al., 2015).
Regression and Domain Adaptation: Distribution-to-distribution regression frameworks leverage SW for regression of multivariate distributions given Euclidean predictors, employing slicing transforms as a theoretical foundation (Chen et al., 2023).
Gradient Flows and Optimization: SW provides tractable and geometrically meaningful objective functionals for particle-based optimization and flow-based methods, with rigorous analysis of stability, critical point structure, and avoidance of lower-dimensional degeneracies (Vauthier et al., 10 Feb 2025).
Statistical Tests: SW and its max variants give rise to two-sample test statistics with dimension-free asymptotics and bootstrapped significance evaluation (Xu et al., 2022).
Streaming Analytics: Stream-SW enables one-pass, low-memory adaptability to streaming data in tasks ranging from point-cloud analysis to change point detection (Nguyen, 11 May 2025).

7. Open Problems and Future Directions

Despite these advances, several fundamental questions remain active areas of research:

Geometry of SW spaces: Detailed characterization of geodesics, convexity, and the analytic properties of SW length metrics, as well as connections to higher-order Sobolev norms and nonlocal partial differential equations describing SW gradient flows (Park et al., 2023, Vauthier et al., 10 Feb 2025).
Adaptive and Learned Slicing Distributions: Data-dependent, discriminative, and energy-based projection schemes for SW and SSW are still being actively refined, with theoretical and empirical trade-offs between discrimination, generalization, and computational cost (Ohana et al., 2022, Zhang et al., 26 Dec 2024, Nguyen et al., 2023).
Variance Reduction and Deterministic Approximation: Optimal design of control variate and deterministic approximation methods (including leveraging higher-order moments and spherical harmonics) for reduced estimator variance in large-scale and high-dimensional settings (Nguyen et al., 2023, Leluc et al., 2 Feb 2024).
Nonlinear, Hierarchical, and Tree-Based Slicing: The impact and limitations of extending slicing to nonlinear or tree-based systems, ensuring metricity and stability, and further reducing computational complexity in structure-rich domains (Tran et al., 2 May 2025, Nguyen et al., 2022).
Streaming and Online Methods: Extending streaming SW techniques to multivariate and manifold-valued data and integrating them with real-time inference and adaptive learning pipelines (Nguyen, 11 May 2025).

These directions underscore the ongoing integration of mathematical analysis, geometric measure theory, algorithmic scalability, and application-driven design in the evolving theory and use of the Sliced Wasserstein Distance.