Sliced Wasserstein Distance
- Sliced Wasserstein Distance (SWD) is a metric defined by integrating one-dimensional Wasserstein distances over random projections of high-dimensional probability measures.
- It offers computational efficiency through Monte Carlo approximations and sorting-based quantile estimation, achieving dimension-free statistical rates and robust performance.
- Advanced variants like Max-Sliced, Distributional, and Energy-Based SWD adapt the methodology for enhanced discrimination in applications such as generative modeling and geometric data analysis.
The Sliced Wasserstein Distance (SWD) is a computational optimal transport metric that leverages one-dimensional projections to define tractable discrepancies between probability measures on high-dimensional spaces. SWD enjoys genuine metric structure, efficient Monte Carlo approximations via quantile sorting, and dimension-free statistical rates. Its fundamental properties and extensions have made it a core method in statistical machine learning, generative modeling, geometric data analysis, and scalable optimal transport.
1. Formal Definition and Metric Structure
Let be two Borel probability measures with finite -th moments on . For a unit vector (the unit sphere in ), denote by the one-dimensional projection. The pushforward (marginal) measures are , . The one-dimensional -Wasserstein distance has a closed form in terms of quantile functions,
where is the quantile function. The -Sliced Wasserstein distance is defined as
with the uniform measure over (Vauthier et al., 10 Feb 2025).
SWD is a bona-fide metric on (the space of probability measures with finite -th moment), satisfying non-negativity, symmetry, definiteness, and the triangle inequality. SWD metrizes the weak topology induced by , and there exist constants (depending on compact support) such that (Chen et al., 2023).
2. Computational and Statistical Properties
Computational advantages are central to the appeal of SWD. Each slice (projection) reduces the high-dimensional OT problem to a one-dimensional case; for discrete measures supported on points, the 1D requires (sorting-based) complexity per slice. Approximating the integral by averaging over random projections (Monte Carlo estimation) yields overall complexity, independent of except for the cost to project.
Sample complexity of SWD is dimension-free: for log-concave distributions with covariance , the empirical SWD converges at rate
(with possible log factors at ), independent of the ambient dimension (Nietert et al., 2022). This circumvents the curse of dimensionality inherent to classical .
Furthermore, for empirical SWD estimators over projections, the expected error decays as , with constants also dimension-free or even improving in under matching means and identity covariance—an instance of the "blessing of dimensionality" (Nietert et al., 2022). Concentration inequalities and limit theorems (including Banach-space CLTs) have been established in the case (Xu et al., 2022).
3. Generalizations and Variants
Several extensions and variants of classical SWD have been developed to focus on informative projections or adapt to non-Euclidean settings.
Max-Sliced Wasserstein (MSW):
emphasizes the largest discrepancy direction. While affording greater discrimination power, estimation and optimization are typically nonconvex (Nietert et al., 2022, Xu et al., 2022).
Adaptive and Distributional Sliced Wasserstein: Rather than averaging over the uniform measure, adaptive schemes learn a slicing measure concentrated on discriminative directions. Distributional Sliced Wasserstein (DSW) finds a measure on maximizing
where penalizes directional alignment. DSW interpolates between classic SW (uniform ) and MSW (atomic ) (Nguyen et al., 2020).
Energy-Based SW (EBSW): The slicing law is specified via an energy function of the projected Wasserstein cost, for monotone , concentrating samples on informative directions—no parametric optimization required. It is a semi-metric, satisfies SW EBSW MSW , and admits practical estimators with similar computational complexity to SW (Nguyen et al., 2023).
Manifold and Spherical Extensions: SWD has been defined on Cartan–Hadamard manifolds via geodesic projections and on the sphere using the spherical Radon transform, yielding analogues like CH–Sliced Wasserstein and Spherical Sliced Wasserstein (SSW) for manifold-valued data (Bonet et al., 2024, Bonet et al., 2022).
Multi-marginal SW: Sliced multi-marginal Wasserstein (SMW) integrates multi-marginal 1D OT, providing a generalized metric with dimension-free sample complexity and connections to Wasserstein barycenters, used in multitask and reinforcement learning (Cohen et al., 2021).
4. Algorithmic Developments and Practical Estimation
Monte Carlo SWD: The expectation over the sphere is approximated by iid (or orthogonal) samples , with the SWD empirical estimator
Variance reduction via orthogonal projections or control variates (e.g., using Gaussian approximations to projected marginals) can yield significant acceleration, especially in large-scale or deep learning settings (Rowland et al., 2019, Nguyen et al., 2023).
Streaming and Deterministic Approximations: Recent approaches support SWD computation on streaming data via compact quantile sketches for each projection, enabling constant-memory and single-pass computation with theoretical error guarantees (Nguyen, 11 May 2025). Under high-dimensional weak dependence, the SWD of two measures can be closely approximated by a deterministic expression derived from their means and second moments, with sublinear error in and dramatic speed gains (Nadjahi et al., 2021).
Learning Orthogonal Projections: For integration in neural nets, both AEs and GANs can be equipped with differentiable SWD blocks, which learn a small set of orthogonal projections aligned with data discrepancies. This approach dramatically reduces the number of required projections relative to random sampling, yielding efficient and end-to-end differentiable objectives with competitive or superior performance (Wu et al., 2017).
5. Gradient Flows, Stability, and Optimization
Minimizing via gradient flows defines a formal Wasserstein gradient flow
with explicit velocity field defined via barycentric projections of 1D optimal plans across all directions. Existence, uniqueness, and convergence properties are governed by the semi-convexity of along Wasserstein geodesics. In particular, measures that concentrate mass along line segments cannot be stable critical points—any such configuration is a saddle, not a local minimum. Discrete gradient flows admit particle approximations with provable monotonicity and collision-avoidance, and only the target measure can be a stable absolutely continuous critical point under mild regularity (Vauthier et al., 10 Feb 2025).
6. Robustness, Statistical Guarantees, and Applications
Robustness: SWD exhibits minimax-optimal, dimension-free robust estimation risk under contamination, and sliced 1-Wasserstein is tightly linked to robust mean estimation—algorithms and guarantees for the latter directly transfer (Nietert et al., 2022). This robustness is critical in high-dimensional inference and contaminated or heavy-tailed settings.
Statistical Estimation and Testing: Central limit theorems, empirical process theory, and concentration inequalities for both SWD and its max-sliced variant underpin applications in hypothesis testing and model comparison (Xu et al., 2022).
Applications: SWD is now central in generative modeling (SWAE, SWGAN variants), point-cloud analysis, density estimation on manifolds (SSW), representation learning, and regression with distributional responses. Its statistical and computational scalability enable image, video, and 3D shape synthesis at high resolutions, distribution regression, and multi-task structure transfer, among others (Wu et al., 2017, Chen et al., 2023, Cohen et al., 2021, Bonet et al., 2022).
7. Limitations, Open Problems, and Practitioner Guidance
Curse of uninformative projections: In high dimensions, most random projections are nearly orthogonal to data subspaces and thus “uninformative.” Theoretical analysis shows that under a -dimensional subspace model, a global rescaling of SWD by a constant factor suffices to match low-dimensional ground truth, justifying simple learning rate adjustment in practice rather than more complex slicing-adaptation schemes (Tran et al., 2024).
Slice distribution learning vs. complexity: Adapting or learning the slicing distribution (PAC-SW, DSW, EBSW) can enhance contrast and learning speed but at the cost of additional complexity, tuning, or instability. Many recent findings suggest that well-chosen, classical SWD metrics—properly scaled—can match or surpass these variants in common workflows.
Parameter selection: For gradient-based learning, recommended practice is for gradient stability, projections per minibatch, and tuning the learning rate over several orders of magnitude, with no need for explicit subspace estimation (Tran et al., 2024).
Summary Table: Core Algorithmic Components
| Component | Typical Complexity | Role |
|---|---|---|
| 1D (sorting) | Per projection | |
| Monte Carlo SWD | projections | |
| Streaming SWD | Quantile sketches, streaming | |
| Variance-reduced SWD | Leverages control variates | |
| Deterministic SWD est. | For high , fast CLT approx |
Practitioners should consider the structure of their data (e.g., subspace concentration), computational resources, and need for discriminative slicing when choosing between classical SWD, its adaptive variants, or control-variates and streaming schemes.
References:
- (Vauthier et al., 10 Feb 2025) Properties of Wasserstein Gradient Flows for the Sliced-Wasserstein Distance
- (Wu et al., 2017) Sliced Wasserstein Generative Models
- (Nietert et al., 2022) Statistical, Robustness, and Computational Guarantees for Sliced Wasserstein Distances
- (Nguyen et al., 2023) Energy-Based Sliced Wasserstein Distance
- (Nguyen et al., 2020) Distributional Sliced-Wasserstein and Applications to Generative Modeling
- (Ohana et al., 2022) Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances
- (Bonet et al., 2024) Sliced-Wasserstein Distances and Flows on Cartan-Hadamard Manifolds
- (Nguyen, 11 May 2025) Streaming Sliced Optimal Transport
- (Tran et al., 2024) Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices
- (Rowland et al., 2019) Orthogonal Estimation of Wasserstein Distances
- (Chen et al., 2023) Sliced Wasserstein Regression
- (Cohen et al., 2021) Sliced Multi-Marginal Optimal Transport
- (Xu et al., 2022) Central limit theorem for the Sliced 1-Wasserstein distance and the max-Sliced 1-Wasserstein distance
- (Nguyen et al., 2023) Sliced Wasserstein Estimation with Control Variates
- (Nadjahi et al., 2021) Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections
- (Bonet et al., 2022) Spherical Sliced-Wasserstein
- (Nguyen et al., 24 Sep 2025) Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances