Papers
Topics
Authors
Recent
2000 character limit reached

Sliced Wasserstein Distance

Updated 14 January 2026
  • Sliced Wasserstein Distance (SWD) is a metric defined by integrating one-dimensional Wasserstein distances over random projections of high-dimensional probability measures.
  • It offers computational efficiency through Monte Carlo approximations and sorting-based quantile estimation, achieving dimension-free statistical rates and robust performance.
  • Advanced variants like Max-Sliced, Distributional, and Energy-Based SWD adapt the methodology for enhanced discrimination in applications such as generative modeling and geometric data analysis.

The Sliced Wasserstein Distance (SWD) is a computational optimal transport metric that leverages one-dimensional projections to define tractable discrepancies between probability measures on high-dimensional spaces. SWD enjoys genuine metric structure, efficient Monte Carlo approximations via quantile sorting, and dimension-free statistical rates. Its fundamental properties and extensions have made it a core method in statistical machine learning, generative modeling, geometric data analysis, and scalable optimal transport.

1. Formal Definition and Metric Structure

Let μ,ν\mu, \nu be two Borel probability measures with finite pp-th moments on Rd\mathbb R^d. For a unit vector θSd1\theta \in S^{d-1} (the unit sphere in Rd\mathbb R^d), denote by Pθ:xθ,xP_\theta: x \mapsto \langle\theta, x\rangle the one-dimensional projection. The pushforward (marginal) measures are μθ=Pθ#μ\mu_\theta = P_{\theta\#} \mu, νθ=Pθ#ν\nu_\theta = P_{\theta\#} \nu. The one-dimensional pp-Wasserstein distance has a closed form in terms of quantile functions,

Wpp(μθ,νθ)=01Fμθ1(t)Fνθ1(t)pdt,W_p^p(\mu_\theta, \nu_\theta) = \int_0^1 |F_{\mu_\theta}^{-1}(t) - F_{\nu_\theta}^{-1}(t)|^p\, dt,

where Fμθ1F_{\mu_\theta}^{-1} is the quantile function. The pp-Sliced Wasserstein distance is defined as

SWp(μ,ν)=(Sd1Wpp(μθ,νθ)dσ(θ))1/pSW_p(\mu, \nu) = \left( \int_{S^{d-1}} W_p^p(\mu_\theta, \nu_\theta)\, d\sigma(\theta) \right)^{1/p}

with σ\sigma the uniform measure over Sd1S^{d-1} (Vauthier et al., 10 Feb 2025).

SWD is a bona-fide metric on Pp(Rd)\mathcal{P}_p(\mathbb R^d) (the space of probability measures with finite pp-th moment), satisfying non-negativity, symmetry, definiteness, and the triangle inequality. SWD metrizes the weak topology induced by WpW_p, and there exist constants c1,c2>0c_1, c_2>0 (depending on compact support) such that c1SWp(μ,ν)Wp(μ,ν)c2SWp(μ,ν)c_1 SW_p(\mu, \nu) \leq W_p(\mu, \nu) \leq c_2 SW_p(\mu, \nu) (Chen et al., 2023).

2. Computational and Statistical Properties

Computational advantages are central to the appeal of SWD. Each slice (projection) reduces the high-dimensional OT problem to a one-dimensional case; for discrete measures supported on NN points, the 1D WpW_p requires O(NlogN)O(N\log N) (sorting-based) complexity per slice. Approximating the integral by averaging over LL random projections (Monte Carlo estimation) yields overall O(LNlogN)O(L N \log N) complexity, independent of dd except for the cost to project.

Sample complexity of SWD is dimension-free: for log-concave distributions with covariance Σ\Sigma, the empirical SWD converges at rate

E[SWpavg(μ^n,μ)]pΣop1/2n1/(2p)\mathbb{E}[SW_p^{avg}(\widehat{\mu}_n, \mu)] \lesssim_p \|\Sigma\|_{op}^{1/2} n^{-1/(2\vee p)}

(with possible log factors at p=2p=2), independent of the ambient dimension dd (Nietert et al., 2022). This circumvents the curse of dimensionality inherent to classical WpW_p.

Furthermore, for empirical SWD estimators over mm projections, the expected error decays as O(m1/2)O(m^{-1/2}), with constants also dimension-free or even improving in dd under matching means and identity covariance—an instance of the "blessing of dimensionality" (Nietert et al., 2022). Concentration inequalities and limit theorems (including Banach-space CLTs) have been established in the p=1p=1 case (Xu et al., 2022).

3. Generalizations and Variants

Several extensions and variants of classical SWD have been developed to focus on informative projections or adapt to non-Euclidean settings.

Max-Sliced Wasserstein (MSW):

MSWp(μ,ν)=maxθSd1Wp(μθ,νθ)MSW_p(\mu, \nu) = \max_{\theta \in S^{d-1}} W_p(\mu_\theta, \nu_\theta)

emphasizes the largest discrepancy direction. While affording greater discrimination power, estimation and optimization are typically nonconvex (Nietert et al., 2022, Xu et al., 2022).

Adaptive and Distributional Sliced Wasserstein: Rather than averaging over the uniform measure, adaptive schemes learn a slicing measure concentrated on discriminative directions. Distributional Sliced Wasserstein (DSW) finds a measure ψ\psi on Sd1S^{d-1} maximizing

DSWλ(μ,ν)=supψP(Sd1)Eθψ[W2(μθ,νθ)]λR(ψ)DSW_\lambda(\mu, \nu) = \sup_{\psi \in \mathcal{P}(S^{d-1})} \mathbb{E}_{\theta \sim \psi}[W_2(\mu_\theta, \nu_\theta)] - \lambda R(\psi)

where R(ψ)R(\psi) penalizes directional alignment. DSW interpolates between classic SW (uniform ψ\psi) and MSW (atomic ψ\psi) (Nguyen et al., 2020).

Energy-Based SW (EBSW): The slicing law is specified via an energy function of the projected Wasserstein cost, p(v)f(Wpp(Pv,Qv))p(v) \propto f(W_p^p(P_v, Q_v)) for monotone ff, concentrating samples on informative directions—no parametric optimization required. It is a semi-metric, satisfies SW \leq EBSW \leq MSW Wp\leq W_p, and admits practical estimators with similar computational complexity to SW (Nguyen et al., 2023).

Manifold and Spherical Extensions: SWD has been defined on Cartan–Hadamard manifolds via geodesic projections and on the sphere using the spherical Radon transform, yielding analogues like CH–Sliced Wasserstein and Spherical Sliced Wasserstein (SSW) for manifold-valued data (Bonet et al., 2024, Bonet et al., 2022).

Multi-marginal SW: Sliced multi-marginal Wasserstein (SMW) integrates multi-marginal 1D OT, providing a generalized metric with dimension-free sample complexity and connections to Wasserstein barycenters, used in multitask and reinforcement learning (Cohen et al., 2021).

4. Algorithmic Developments and Practical Estimation

Monte Carlo SWD: The expectation over the sphere is approximated by LL iid (or orthogonal) samples θ1,...,θL\theta_1, ..., \theta_L, with the SWD empirical estimator

SW^pp(μ,ν)=1L=1LWpp(μθ,νθ).\widehat{SW}_p^p(\mu, \nu) = \frac{1}{L} \sum_{\ell=1}^L W_p^p(\mu_{\theta_\ell}, \nu_{\theta_\ell}).

Variance reduction via orthogonal projections or control variates (e.g., using Gaussian approximations to projected marginals) can yield significant acceleration, especially in large-scale or deep learning settings (Rowland et al., 2019, Nguyen et al., 2023).

Streaming and Deterministic Approximations: Recent approaches support SWD computation on streaming data via compact quantile sketches for each projection, enabling constant-memory and single-pass computation with theoretical error guarantees (Nguyen, 11 May 2025). Under high-dimensional weak dependence, the SWD of two measures can be closely approximated by a deterministic expression derived from their means and second moments, with sublinear error in dd and dramatic speed gains (Nadjahi et al., 2021).

Learning Orthogonal Projections: For integration in neural nets, both AEs and GANs can be equipped with differentiable SWD blocks, which learn a small set of orthogonal projections aligned with data discrepancies. This approach dramatically reduces the number of required projections relative to random sampling, yielding efficient and end-to-end differentiable objectives with competitive or superior performance (Wu et al., 2017).

5. Gradient Flows, Stability, and Optimization

Minimizing F(μ)=12SW22(μ,ρ)F(\mu) = \tfrac{1}{2} SW_2^2(\mu, \rho) via gradient flows defines a formal Wasserstein gradient flow

tμt+(vμtμt)=0,\partial_t\mu_t + \nabla \cdot (v_{\mu_t} \mu_t) = 0,

with explicit velocity field vμv_\mu defined via barycentric projections of 1D optimal plans across all directions. Existence, uniqueness, and convergence properties are governed by the semi-convexity of FF along Wasserstein geodesics. In particular, measures that concentrate mass along line segments cannot be stable critical points—any such configuration is a saddle, not a local minimum. Discrete gradient flows admit particle approximations with provable monotonicity and collision-avoidance, and only the target measure can be a stable absolutely continuous critical point under mild regularity (Vauthier et al., 10 Feb 2025).

6. Robustness, Statistical Guarantees, and Applications

Robustness: SWD exhibits minimax-optimal, dimension-free robust estimation risk under contamination, and sliced 1-Wasserstein is tightly linked to robust mean estimation—algorithms and guarantees for the latter directly transfer (Nietert et al., 2022). This robustness is critical in high-dimensional inference and contaminated or heavy-tailed settings.

Statistical Estimation and Testing: Central limit theorems, empirical process theory, and concentration inequalities for both SWD and its max-sliced variant underpin applications in hypothesis testing and model comparison (Xu et al., 2022).

Applications: SWD is now central in generative modeling (SWAE, SWGAN variants), point-cloud analysis, density estimation on manifolds (SSW), representation learning, and regression with distributional responses. Its statistical and computational scalability enable image, video, and 3D shape synthesis at high resolutions, distribution regression, and multi-task structure transfer, among others (Wu et al., 2017, Chen et al., 2023, Cohen et al., 2021, Bonet et al., 2022).

7. Limitations, Open Problems, and Practitioner Guidance

Curse of uninformative projections: In high dimensions, most random projections are nearly orthogonal to data subspaces and thus “uninformative.” Theoretical analysis shows that under a kk-dimensional subspace model, a global rescaling of SWD by a constant factor suffices to match low-dimensional ground truth, justifying simple learning rate adjustment in practice rather than more complex slicing-adaptation schemes (Tran et al., 2024).

Slice distribution learning vs. complexity: Adapting or learning the slicing distribution (PAC-SW, DSW, EBSW) can enhance contrast and learning speed but at the cost of additional complexity, tuning, or instability. Many recent findings suggest that well-chosen, classical SWD metrics—properly scaled—can match or surpass these variants in common workflows.

Parameter selection: For gradient-based learning, recommended practice is p=2p=2 for gradient stability, L=50100L=50\text{--}100 projections per minibatch, and tuning the learning rate η\eta over several orders of magnitude, with no need for explicit subspace estimation (Tran et al., 2024).

Summary Table: Core Algorithmic Components

Component Typical Complexity Role
1D WpW_p (sorting) O(NlogN)O(N\log N) Per projection
Monte Carlo SWD O(LNlogN)O(LN\log N) LL projections
Streaming SWD O(Ln/k+Lk+Ldn)O(L n/k + Lk + Ldn) Quantile sketches, streaming
Variance-reduced SWD O(LNlogN)O(LN\log N) Leverages control variates
Deterministic SWD est. O(dn)O(dn) For high dd, fast CLT approx

Practitioners should consider the structure of their data (e.g., subspace concentration), computational resources, and need for discriminative slicing when choosing between classical SWD, its adaptive variants, or control-variates and streaming schemes.


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sliced Wasserstein Distance (SWD).