Generalized Sliced Wasserstein Distance

Updated 14 November 2025

Generalized Sliced Wasserstein (GSW) Distance is a metric that extends classical SW by using a family of nonlinear projections to capture complex geometric discrepancies in high-dimensional probability measures.
It leverages Monte Carlo integration over flexible slicers (linear, polynomial, neural) to maintain key metric properties like non-negativity, triangle inequality, and invariances, with scalable computational costs.
GSW enables deterministic approximations and Euclidean embeddings via the Cumulative Distribution Transform, offering practical benefits in generative modeling, set representation, and manifold-based data analysis.

The Generalized Sliced Wasserstein (GSW) distance extends the notion of the sliced Wasserstein (SW) metric to a broader class of nonlinear projections, allowing optimal transport-based distances between high-dimensional probability measures to capture more complex geometric discrepancies. GSW distances have found application in generative modeling, set representation learning, high-dimensional and manifold-structured data, providing both scalable computational properties and rich geometric expressivity (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022, Chapel et al., 28 May 2025).

1. Formal Definition and Mathematical Foundations

Given two Borel probability measures $\mu, \nu$ on $\mathbb{R}^d$ with finite $p$ -th moments, and a family of measurable “slicer” or “defining” functions $g_\theta: \mathbb{R}^d \to \mathbb{R}$ parameterized by $\theta\in\Omega_\theta\subset\mathbb{R}^{d_\theta}$ , the GSW distance of order $p$ is defined as: $\mathrm{GSW}_p(\mu,\nu) = \left(\int_{\Omega_\theta} W_p^p\!\left(g_{\theta\#}\mu,\;g_{\theta\#}\nu\right)\,d\theta\right)^{1/p}$ where $g_{\theta\#}\mu$ is the push-forward of $\mu$ under $g_\theta$ , and $W_p$ is the one-dimensional $p$ -Wasserstein distance, which has a closed form via quantile functions: $W_p^p(\mu_t, \nu_t) = \int_0^1 |F_{\mu_t}^{-1}(u) - F_{\nu_t}^{-1}(u)|^p\,du$ For the linear case $g_\theta(x)=\langle x,\theta\rangle,\, \theta \in S^{d-1}$ , classical SW is recovered. GSW can also be maximized over $\theta$ to define the max-GSW distance: $\textrm{max-GSW}_p(\mu, \nu) = \sup_{\theta\in\Omega_\theta} W_p \left(g_{\theta\#}\mu,\,g_{\theta\#}\nu\right)$ The defining function $g_\theta$ can be linear, polynomial, trigonometric, or even a neural network, providing a rich parameterization space.

GSW is closely related to the generalized Radon transform, replacing integration over hyperplanes (in classical SW) with integration over a family of hypersurfaces determined by $g_\theta$ (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022).

2. Metric Properties, Invariances, and Injectivity

The GSW distance satisfies the following metric properties under mild assumptions on $g_\theta$ :

Nonnegativity and symmetry are inherited from $W_p$ .
Triangle inequality holds due to $L^p(\Omega_\theta)$ integration and Minkowski's inequality.
Identity of indiscernibles (i.e., $\mathrm{GSW}_p(\mu,\nu)=0$ implies $\mu=\nu$ ) holds if and only if the generalized Radon transform $\theta\mapsto g_{\theta\#}\mu$ is injective; specific injectivity and regularity on $g_\theta$ ensure GSW is a proper metric.
Translation invariance: translating both measures by $a\in\mathbb{R}^d$ leaves GSW unchanged.
Rotational invariance (in the linear case) follows from choosing a rotationally invariant measure over $\theta$ .
Equivalence to full Wasserstein: requires $g_\theta$ to form a sufficiently rich or “complete” family; otherwise, GSW is a weaker, efficiently computable surrogate.

Examples of injective $g_\theta$ families include odd-degree homogeneous polynomials and certain spherical or trigonometric functions (Kolouri et al., 2019). This allows GSW and max-GSW distances to be valid metrics over probability measures under the stated assumptions.

3. Algorithms for Efficient Computation and Approximation

Computation of GSW distances proceeds by Monte Carlo integration over a finite set of slicers $\{\theta_\ell\}_{\ell=1}^L$ :

For each $\theta_\ell$ , project the data: $Y_\mu^\ell = [g_{\theta_\ell}(x_1), \ldots, g_{\theta_\ell}(x_{M_\mu})]$ .
Compute the one-dimensional Wasserstein distance $W_p(g_{\theta_\ell\#}\mu, g_{\theta_\ell\#}\nu)$ via sorting and quantile matching.
Aggregate over $\ell$ to approximate the integral.

The computational complexity for $N$ samples and $L$ slicers is $O(L\,N\log N)$ for SW/GSW (due to the sorting operations) (NaderiAlizadeh et al., 2021, Kolouri et al., 2019, Le et al., 2022). When $g_\theta$ is a neural network or a complex polynomial, the additional cost is $O(L\,N\,\mathrm{cost}(g))$ per evaluation.

For deterministic, closed-form approximations:

With polynomial $g_\theta$ , GSW can be equivalently computed by lifting the data to monomial feature space, and then applying SW or even using Gaussian approximations in high dimensions, yielding cost $O(N\,d^m)$ for degree- $m$ and error decaying as $O(d^{-m/8})$ (Le et al., 2022).
With neural-network-based $g_\theta$ , the algorithm lifts data via random Gaussian matrices, similarly reducing GSW computation to a SW in transformed space; complexity is $O(N\,d^2 n)$ for $n$ layers, with error $O(3^{n/4}d^{-1/4})$ .

Max-GSW and variants relying on optimization over $\theta$ are more costly due to the optimization loop but can capture most discriminative projections in fewer slices.

4. Euclidean Embedding via the Cumulative Distribution Transform

The GSW metric admits an exact isometric Euclidean embedding harnessing the Cumulative Distribution Transform (CDT):

For each slice $g_\theta$ , compute the optimal Monge map $f_\mu^\theta(t) = F_{g_{\theta\#} \mu}^{-1}(F_{g_{\theta\#}\mu_0}(t))$ pushing the projected empirical distribution onto a reference.
The transport residual $\hat{\nu}_\mu^\theta(t) = f_\mu^\theta(t) - t$ captures the cost in $L^p$ on the one-dimensional slice.
The global embedding $\Psi(\mu) = \{\hat{\nu}_\mu^\theta\}_{\theta\in\Omega_\theta}$ satisfies: $\|\Psi(\mu) - \Psi(\nu)\|_{L^p(\Omega_\theta;L^p(\nu_0^\theta))} = \mathrm{GSW}_p(\mu, \nu)$ Thus, GSW defines a Euclidean (or $\ell^p$ ) metric space for probability measures, enabling standard statistical learning operations while preserving optimal transport geometry (NaderiAlizadeh et al., 2021).

Empirically, the embedding can be approximated by projecting, sorting, and quantile matching, then concatenating residuals across slices, yielding a permutation-invariant, geometry-aware embedding suitable for machine learning on set-structured data.

5. Applications and Comparative Performance

GSW distances have demonstrated practical advantages in various machine learning tasks:

Generative modeling: GSW-regularized flows and autoencoders show faster convergence and lower true Wasserstein loss compared to classical SW, particularly when the ground-truth distributions are complex or multimodal. Using max-GSW further improves performance, matching adversarial OT-based methods in some settings (Kolouri et al., 2019).
Set representation learning: Geometrically interpretable, permutation-invariant embeddings from GSW outperform state-of-the-art pooling (e.g., attention, mean) and metric learning approaches in tasks such as classification and matching, thanks to exact isometry and the ability to incorporate nonlinear structure (NaderiAlizadeh et al., 2021).
Scalability: With computational costs linear in $L$ and $N$ (up to log factors), GSW easily scales to high-dimensional and large-scale data. For very high-dimensional problems, deterministic/surrogate estimators leveraging Gaussian concentration are effective, with error decreasing as $d$ increases (Le et al., 2022).
Manifold and hyperbolic data: GSW plans extend to data on Riemannian manifolds by defining slice functions $\phi$ compatible with manifold structure (e.g., horosphere projections for hyperbolic space), and employing Riemannian gradient-based optimization (Chapel et al., 28 May 2025).
Conditional flow matching and transport plan inference: Differentiable GSW plans yield efficient, accurately computed OT plans for gradient flows and neural generative models; empirical Fréchet Inception Distance (FID) benchmarks on CIFAR-10 show improved sample quality with fewer function evaluations using DGSWP (Chapel et al., 28 May 2025).

6. Advanced Extensions: Bilevel Optimization and Differentiable Plans

Recent developments cast the min-GSW problem as a bilevel optimization:

Lower-level: Solve 1D OT for fixed projection $\theta$ (sort-and-match).
Upper-level: Minimize the true cost in the original ground metric by optimizing over $\theta$ . Due to the non-differentiability of sorting, smoothing via Gaussian perturbations enables gradient-based optimization (Stein’s lemma), facilitating end-to-end training even with neural-network slicers (Chapel et al., 28 May 2025).

Algorithmic complexity per iteration is $O(N \cdot ((n+m)\log(n+m)))$ for $N$ Monte Carlo smoothing samples. Empirical performance shows that neural-slicer min-GSW plans can recover complex optimal matchings even in high dimensions and on manifolds.

A major advantage is the upper-bounding property: for any fixed $\theta$ , the resulting plan provides a super-optimal plan for true OT, with metric consistency under mild conditions.

7. Limitations, Open Problems, and Practical Guidelines

Despite its flexibility, several practical and theoretical aspects shape GSW research:

Slice complexity: In worst-case geometry, the number of needed slices $L$ for accurate approximation can grow with data dimension; learning or optimizing slicers mitigates but does not remove this scaling (NaderiAlizadeh et al., 2021).
Defining function design: Polynomial and neural-network slicers yield the most control, but the former’s cost quickly grows with degree and dimension, while the latter’s bias/variance tradeoff depends on network depth and width (Le et al., 2022).
Nonlinear projections: Certain choices, such as trigonometric/circular slices, lack deterministic, closed-form fast estimators; such cases currently rely on Monte Carlo methods.
Manifold extension: While general, the quality of the learned or chosen manifold projections (e.g., horospheres) crucially impacts metric properties and plan accuracy.
Practical recipes: For moderate $d$ , polynomial GSW offers high-precision at polynomial cost. For large $d$ , neural GSW gives good accuracy at quadratic cost. Combining these with random feature compression (e.g., PCA) is effective for $d \gg 10^3$ .

Summary table of computational and statistical properties:

Method	Complexity	Error in High- $d$
Monte Carlo GSW	$O(M\,n\log n)$	Decreases with $M$
Poly-GSW ( $m$ )	$O(N d^m)$	$O(d^{-m/8})$
Neural-GSW ( $n$ )	$O(Nd^2n)$	$O(3^{n/4}d^{-1/4})$

In conclusion, the Generalized Sliced Wasserstein family of distances and embeddings offers a flexible, scalable, and geometry-sensitive framework for optimal transport-based learning and inference, subsuming classical SW and enabling new directions in modern high-dimensional machine learning (Kolouri et al., 2019, NaderiAlizadeh et al., 2021, Le et al., 2022, Chapel et al., 28 May 2025).