Papers
Topics
Authors
Recent
2000 character limit reached

Generalized Sliced Wasserstein Distance

Updated 14 November 2025
  • Generalized Sliced Wasserstein (GSW) Distance is a metric that extends classical SW by using a family of nonlinear projections to capture complex geometric discrepancies in high-dimensional probability measures.
  • It leverages Monte Carlo integration over flexible slicers (linear, polynomial, neural) to maintain key metric properties like non-negativity, triangle inequality, and invariances, with scalable computational costs.
  • GSW enables deterministic approximations and Euclidean embeddings via the Cumulative Distribution Transform, offering practical benefits in generative modeling, set representation, and manifold-based data analysis.

The Generalized Sliced Wasserstein (GSW) distance extends the notion of the sliced Wasserstein (SW) metric to a broader class of nonlinear projections, allowing optimal transport-based distances between high-dimensional probability measures to capture more complex geometric discrepancies. GSW distances have found application in generative modeling, set representation learning, high-dimensional and manifold-structured data, providing both scalable computational properties and rich geometric expressivity (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022, Chapel et al., 28 May 2025).

1. Formal Definition and Mathematical Foundations

Given two Borel probability measures μ,ν\mu, \nu on Rd\mathbb{R}^d with finite pp-th moments, and a family of measurable “slicer” or “defining” functions gθ:RdRg_\theta: \mathbb{R}^d \to \mathbb{R} parameterized by θΩθRdθ\theta\in\Omega_\theta\subset\mathbb{R}^{d_\theta}, the GSW distance of order pp is defined as: GSWp(μ,ν)=(ΩθWpp ⁣(gθ#μ,  gθ#ν)dθ)1/p\mathrm{GSW}_p(\mu,\nu) = \left(\int_{\Omega_\theta} W_p^p\!\left(g_{\theta\#}\mu,\;g_{\theta\#}\nu\right)\,d\theta\right)^{1/p} where gθ#μg_{\theta\#}\mu is the push-forward of μ\mu under gθg_\theta, and WpW_p is the one-dimensional pp-Wasserstein distance, which has a closed form via quantile functions: Wpp(μt,νt)=01Fμt1(u)Fνt1(u)pduW_p^p(\mu_t, \nu_t) = \int_0^1 |F_{\mu_t}^{-1}(u) - F_{\nu_t}^{-1}(u)|^p\,du For the linear case gθ(x)=x,θ,θSd1g_\theta(x)=\langle x,\theta\rangle,\, \theta \in S^{d-1}, classical SW is recovered. GSW can also be maximized over θ\theta to define the max-GSW distance: max-GSWp(μ,ν)=supθΩθWp(gθ#μ,gθ#ν)\textrm{max-GSW}_p(\mu, \nu) = \sup_{\theta\in\Omega_\theta} W_p \left(g_{\theta\#}\mu,\,g_{\theta\#}\nu\right) The defining function gθg_\theta can be linear, polynomial, trigonometric, or even a neural network, providing a rich parameterization space.

GSW is closely related to the generalized Radon transform, replacing integration over hyperplanes (in classical SW) with integration over a family of hypersurfaces determined by gθg_\theta (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022).

2. Metric Properties, Invariances, and Injectivity

The GSW distance satisfies the following metric properties under mild assumptions on gθg_\theta:

  • Nonnegativity and symmetry are inherited from WpW_p.
  • Triangle inequality holds due to Lp(Ωθ)L^p(\Omega_\theta) integration and Minkowski's inequality.
  • Identity of indiscernibles (i.e., GSWp(μ,ν)=0\mathrm{GSW}_p(\mu,\nu)=0 implies μ=ν\mu=\nu) holds if and only if the generalized Radon transform θgθ#μ\theta\mapsto g_{\theta\#}\mu is injective; specific injectivity and regularity on gθg_\theta ensure GSW is a proper metric.
  • Translation invariance: translating both measures by aRda\in\mathbb{R}^d leaves GSW unchanged.
  • Rotational invariance (in the linear case) follows from choosing a rotationally invariant measure over θ\theta.
  • Equivalence to full Wasserstein: requires gθg_\theta to form a sufficiently rich or “complete” family; otherwise, GSW is a weaker, efficiently computable surrogate.

Examples of injective gθg_\theta families include odd-degree homogeneous polynomials and certain spherical or trigonometric functions (Kolouri et al., 2019). This allows GSW and max-GSW distances to be valid metrics over probability measures under the stated assumptions.

3. Algorithms for Efficient Computation and Approximation

Computation of GSW distances proceeds by Monte Carlo integration over a finite set of slicers {θ}=1L\{\theta_\ell\}_{\ell=1}^L:

  1. For each θ\theta_\ell, project the data: Yμ=[gθ(x1),,gθ(xMμ)]Y_\mu^\ell = [g_{\theta_\ell}(x_1), \ldots, g_{\theta_\ell}(x_{M_\mu})].
  2. Compute the one-dimensional Wasserstein distance Wp(gθ#μ,gθ#ν)W_p(g_{\theta_\ell\#}\mu, g_{\theta_\ell\#}\nu) via sorting and quantile matching.
  3. Aggregate over \ell to approximate the integral.

The computational complexity for NN samples and LL slicers is O(LNlogN)O(L\,N\log N) for SW/GSW (due to the sorting operations) (NaderiAlizadeh et al., 2021, Kolouri et al., 2019, Le et al., 2022). When gθg_\theta is a neural network or a complex polynomial, the additional cost is O(LNcost(g))O(L\,N\,\mathrm{cost}(g)) per evaluation.

For deterministic, closed-form approximations:

  • With polynomial gθg_\theta, GSW can be equivalently computed by lifting the data to monomial feature space, and then applying SW or even using Gaussian approximations in high dimensions, yielding cost O(Ndm)O(N\,d^m) for degree-mm and error decaying as O(dm/8)O(d^{-m/8}) (Le et al., 2022).
  • With neural-network-based gθg_\theta, the algorithm lifts data via random Gaussian matrices, similarly reducing GSW computation to a SW in transformed space; complexity is O(Nd2n)O(N\,d^2 n) for nn layers, with error O(3n/4d1/4)O(3^{n/4}d^{-1/4}).

Max-GSW and variants relying on optimization over θ\theta are more costly due to the optimization loop but can capture most discriminative projections in fewer slices.

4. Euclidean Embedding via the Cumulative Distribution Transform

The GSW metric admits an exact isometric Euclidean embedding harnessing the Cumulative Distribution Transform (CDT):

  • For each slice gθg_\theta, compute the optimal Monge map fμθ(t)=Fgθ#μ1(Fgθ#μ0(t))f_\mu^\theta(t) = F_{g_{\theta\#} \mu}^{-1}(F_{g_{\theta\#}\mu_0}(t)) pushing the projected empirical distribution onto a reference.
  • The transport residual ν^μθ(t)=fμθ(t)t\hat{\nu}_\mu^\theta(t) = f_\mu^\theta(t) - t captures the cost in LpL^p on the one-dimensional slice.
  • The global embedding Ψ(μ)={ν^μθ}θΩθ\Psi(\mu) = \{\hat{\nu}_\mu^\theta\}_{\theta\in\Omega_\theta} satisfies: Ψ(μ)Ψ(ν)Lp(Ωθ;Lp(ν0θ))=GSWp(μ,ν)\|\Psi(\mu) - \Psi(\nu)\|_{L^p(\Omega_\theta;L^p(\nu_0^\theta))} = \mathrm{GSW}_p(\mu, \nu) Thus, GSW defines a Euclidean (or p\ell^p) metric space for probability measures, enabling standard statistical learning operations while preserving optimal transport geometry (NaderiAlizadeh et al., 2021).

Empirically, the embedding can be approximated by projecting, sorting, and quantile matching, then concatenating residuals across slices, yielding a permutation-invariant, geometry-aware embedding suitable for machine learning on set-structured data.

5. Applications and Comparative Performance

GSW distances have demonstrated practical advantages in various machine learning tasks:

  • Generative modeling: GSW-regularized flows and autoencoders show faster convergence and lower true Wasserstein loss compared to classical SW, particularly when the ground-truth distributions are complex or multimodal. Using max-GSW further improves performance, matching adversarial OT-based methods in some settings (Kolouri et al., 2019).
  • Set representation learning: Geometrically interpretable, permutation-invariant embeddings from GSW outperform state-of-the-art pooling (e.g., attention, mean) and metric learning approaches in tasks such as classification and matching, thanks to exact isometry and the ability to incorporate nonlinear structure (NaderiAlizadeh et al., 2021).
  • Scalability: With computational costs linear in LL and NN (up to log factors), GSW easily scales to high-dimensional and large-scale data. For very high-dimensional problems, deterministic/surrogate estimators leveraging Gaussian concentration are effective, with error decreasing as dd increases (Le et al., 2022).
  • Manifold and hyperbolic data: GSW plans extend to data on Riemannian manifolds by defining slice functions ϕ\phi compatible with manifold structure (e.g., horosphere projections for hyperbolic space), and employing Riemannian gradient-based optimization (Chapel et al., 28 May 2025).
  • Conditional flow matching and transport plan inference: Differentiable GSW plans yield efficient, accurately computed OT plans for gradient flows and neural generative models; empirical Fréchet Inception Distance (FID) benchmarks on CIFAR-10 show improved sample quality with fewer function evaluations using DGSWP (Chapel et al., 28 May 2025).

6. Advanced Extensions: Bilevel Optimization and Differentiable Plans

Recent developments cast the min-GSW problem as a bilevel optimization:

  • Lower-level: Solve 1D OT for fixed projection θ\theta (sort-and-match).
  • Upper-level: Minimize the true cost in the original ground metric by optimizing over θ\theta. Due to the non-differentiability of sorting, smoothing via Gaussian perturbations enables gradient-based optimization (Stein’s lemma), facilitating end-to-end training even with neural-network slicers (Chapel et al., 28 May 2025).

Algorithmic complexity per iteration is O(N((n+m)log(n+m)))O(N \cdot ((n+m)\log(n+m))) for NN Monte Carlo smoothing samples. Empirical performance shows that neural-slicer min-GSW plans can recover complex optimal matchings even in high dimensions and on manifolds.

A major advantage is the upper-bounding property: for any fixed θ\theta, the resulting plan provides a super-optimal plan for true OT, with metric consistency under mild conditions.

7. Limitations, Open Problems, and Practical Guidelines

Despite its flexibility, several practical and theoretical aspects shape GSW research:

  • Slice complexity: In worst-case geometry, the number of needed slices LL for accurate approximation can grow with data dimension; learning or optimizing slicers mitigates but does not remove this scaling (NaderiAlizadeh et al., 2021).
  • Defining function design: Polynomial and neural-network slicers yield the most control, but the former’s cost quickly grows with degree and dimension, while the latter’s bias/variance tradeoff depends on network depth and width (Le et al., 2022).
  • Nonlinear projections: Certain choices, such as trigonometric/circular slices, lack deterministic, closed-form fast estimators; such cases currently rely on Monte Carlo methods.
  • Manifold extension: While general, the quality of the learned or chosen manifold projections (e.g., horospheres) crucially impacts metric properties and plan accuracy.
  • Practical recipes: For moderate dd, polynomial GSW offers high-precision at polynomial cost. For large dd, neural GSW gives good accuracy at quadratic cost. Combining these with random feature compression (e.g., PCA) is effective for d103d \gg 10^3.

Summary table of computational and statistical properties:

Method Complexity Error in High-dd
Monte Carlo GSW O(Mnlogn)O(M\,n\log n) Decreases with MM
Poly-GSW (mm) O(Ndm)O(N d^m) O(dm/8)O(d^{-m/8})
Neural-GSW (nn) O(Nd2n)O(Nd^2n) O(3n/4d1/4)O(3^{n/4}d^{-1/4})

In conclusion, the Generalized Sliced Wasserstein family of distances and embeddings offers a flexible, scalable, and geometry-sensitive framework for optimal transport-based learning and inference, subsuming classical SW and enabling new directions in modern high-dimensional machine learning (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022, Chapel et al., 28 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Generalized Sliced Wasserstein (GSW) Distance.