Generalized Sliced Wasserstein Distance

Updated 11 November 2025

Generalized Sliced Wasserstein (GSW) is a metric that extends classical SW by employing nonlinear slicing functions to capture complex data structures.
It leverages a generalized Radon transform to project high-dimensional measures into tractable one-dimensional optimal transport problems with efficient algorithms.
GSW is applied in generative modeling, set representation, and manifold-valued data analysis, offering improved convergence and enhanced embedding quality.

The Generalized Sliced Wasserstein (GSW) distance is a metric on probability measures that extends the classical Sliced Wasserstein (SW) by leveraging nonlinear "slicing" functions, substantially enhancing representational power and computational flexibility while retaining efficient reduction to one-dimensional optimal transport problems. GSW employs a generalized Radon transform to project measures into one-dimensional spaces via a user-chosen family of functions (such as polynomials, nonlinear neural networks, or geometric transforms), capturing complex data structures omitted by linear projections. It is widely used in generative modeling, manifold-supported learning, and representation pooling for set-structured data.

1. Mathematical Definition and Generalization

Let $\mu,\nu$ be two measures on $\mathbb R^d$ . Classical SW computes the $p$ -Wasserstein distance between their one-dimensional projections, averaged over random directions $\theta$ on the sphere: $\mathrm{SW}_p(\mu, \nu) = \left( \int_{S^{d-1}} W_p^p(\langle \cdot, \theta \rangle_\# \mu, \langle \cdot, \theta \rangle_\# \nu) d\theta \right)^{1/p}$ GSW extends this by replacing the inner product with a general defining function $g : \mathbb R^d \times \Omega \to \mathbb R$ ; the projection is now $x \mapsto g(x, \theta)$ for parameter $\theta \in \Omega$ (e.g., a sphere of polynomial coefficients, neural net weights, or geometric descriptors). The GSW distance is

$\mathrm{GSW}_{p, \mathcal H}(\mu,\nu) = \left( \int_\Omega W_p^p(g(\cdot,\theta)_\# \mu, g(\cdot,\theta)_\# \nu) d\pi(\theta) \right)^{1/p}$

where $\pi$ is a probability measure on $\Omega$ . Key choices for $g$ include homogeneous polynomials--- $g(x,\theta) = \sum_{|\alpha|=m} \theta_\alpha x^\alpha$ with $m$ odd---and circular slices--- $g(x, \theta) = \|x - r\theta\|_2$ for some $r$ .

The max-GSW variant selects the single slice $\theta_*$ giving maximal separation: $\text{max-GSW}_{p, \mathcal H}(\mu,\nu) = \sup_{\theta \in \Omega} W_p(g(\cdot, \theta)_\# \mu, g(\cdot, \theta)_\# \nu)$ GSW recovers classical SW for $g(x, \theta) = \langle x, \theta \rangle$ and $\Omega = S^{d-1}$ .

2. Metric Properties, Topology, and Embedding

If the family $\{g(\cdot, \theta)\}_{\theta \in \Omega}$ is injective in the sense of the generalized Radon transform, GSW satisfies non-negativity, symmetry, triangle inequality, and identity-of-indiscernibles, thus constituting a metric on the space of measures with finite $p$ -th moments (Kolouri et al., 2019). In the absence of injectivity, it is only a pseudo-metric.

Key theoretical properties:

Topology: GSW preserves the weak topology and $p$ -th moment convergence of $W_p$ , provided $\mathcal H$ is "rich", i.e., sufficiently measure-separating.
Isometric Embedding: Given a reference measure $\mu_0$ , the cumulative distribution transform (CDT):

$\hat{\nu}_i^\theta = F_{\nu_i^\theta}^{-1} \circ F_{\nu_0^\theta} - \mathrm{id}$

is an isometric embedding of $(\mathcal P_p, \mathrm{GSW}_p)$ into $L^p(\Omega_\theta; L^p(\nu_0^\theta))$ (NaderiAlizadeh et al., 2021). Pairwise $L^p$ distances between these embedding vectors match true GSW.

3. Computational Algorithms and Fast Approximations

The fundamental advantage of GSW is computational: each projection yields a one-dimensional OT problem, solvable in $O(N \log N)$ by sorting (for $N$ samples).

Standard algorithms proceed via Monte Carlo over $\theta$ :

Input: samples {x_i}, {y_j}; L slices; defining function g
for ℓ=1,…,L:
    Sample θ_ℓ from π
    Project: u_i = g(x_i, θ_ℓ), v_j = g(y_j, θ_ℓ)
    Sort u, v; compute W_p
Aggregate: Empirical mean yields GSW
Output: [average]^{1/p}

[max-GSW] inner loop uses gradient ascent over

\theta

Deterministic Approximations. When $g$ is polynomial or neural-network-based, Le et al. (Le et al., 2022) exploit high-dimensional concentration of measure to replace random projections with closed-form moment computations. The conditional CLT for Gaussian projections bounds the error:

Polynomial case: error $O(d^{-m/8})$ for $m$ -th degree
Neural case: error $O(3^{n/4} d^{-1/4} + d^{-1/8})$ for $n$ layers The computation reduces to empirical mean and variance of lifted random variables ( $O(Nd^m)$ for polynomials), yielding deterministic GSW estimates as $d$ increases.

4. Bilevel Optimization and Stein Smoothing

Recent advances embed GSW within bilevel optimization frameworks, particularly min-GSW and min-SWGG. The inner problem minimizes the $1$D transport cost over couplings $\pi$ , while the outer seeks the optimal projection $\theta$ minimizing the (full) Wasserstein cost in the original space (Chapel et al., 28 May 2025): $\min_{\theta \in \mathbb R^q} \langle C_{\mu\nu}, \pi^\theta \rangle \quad \text{s.t.} \quad \pi^\theta \in \arg\min_{\pi \in U(a,b)} \langle C^{\phi}_{\mu\nu}(\theta), \pi \rangle$ Nonlinear projections $\phi(x, \theta) =$ MLP or polynomial increase expressivity, offering tighter couplings in high dimension or manifold-structured data.

Stein Smoothing. Since the outer objective is non-smooth, Stein's lemma yields unbiased gradient estimators for use in first-order optimization. Perturbing $\theta$ with Gaussian noise and averaging provides smooth surrogates $h_\varepsilon(\theta)$ for robust optimization. For manifold-parameterized slices (e.g., spheres), the perturbation distribution adapts (von Mises–Fisher).

5. Extensions to Manifold-Valued Data

GSW naturally accommodates data supported on Riemannian manifolds by redefining slices as intrinsic one-dimensional submanifolds (e.g., geodesics, horospheres). In the Poincaré ball model for hyperbolic geometry, horospherical slices are indexed by $\theta \in S^{d-1}$ : $\phi(x, \theta) = \log \|x - \theta\|^2 - \log(1 - \|x\|^2)$ with the pushforward giving a one-dimensional measure. The optimization framework remains intact, with gradient flows and Stein smoothing adapted to the manifold setting.

6. Applications in Generative Modeling and Set Representation

GSW has found empirical success in a range of applications:

Generative Modeling: Used for gradient flows matching synthetic mixtures and image datasets, GSW with higher-degree polynomials or learned neural slices improves mode matching and accelerates convergence relative to SW (Kolouri et al., 2019, Chapel et al., 28 May 2025).
Auto-Encoding: GSWAE and max-GSWAE incorporate GSW regularization to align encoded latent distributions with priors, outperforming SWAE and matching adversarial approaches (WAE-GAN) in latent/decoded Wasserstein distances.
Conditional Flow Matching: DGSWP-based CFM achieves lower FID (≈ 3.56) than standard OT-CFM (≈ 4.82), with fewer function evaluations on CIFAR-10 (Chapel et al., 28 May 2025).
Set Representation Learning: GSWE layers use the empirical embedding to pool set-structured data, yielding competitive or superior classification and retrieval benchmarks relative to transformer pooling modules (NaderiAlizadeh et al., 2021).

Empirical convergence rates and embedding stability have been characterized, with GSW showing improved qualitative transport plans and more geometrically faithful representations.

7. Limitations, Tradeoffs, and Future Directions

While GSW broadens applicability and computational tractability:

If the GRT is not injective, GSW is only a pseudo-metric and may conflate distinct measures.
The Monte Carlo approximation incurs variance scaling $\sim 1/\sqrt{L}$ with $L$ slices, with tradeoffs in slice richness vs. computational cost.
max-GSW optimization is non-convex and sensitive to initialization.
Fast deterministic approximations do not extend to all slice families (e.g., circular functions remain open (Le et al., 2022)).
Gradient flows with linear or algebraic slices can stall in high-dimensional settings; nonlinear projections ameliorate these issues.

Promising directions include learning parametrized slice families via neural networks, extending manifold embedding techniques, and further theoretical work on sample complexity and convergence rates as $d \to \infty$ . The utility of GSW in few-step generative modeling and fast mini-batch OT continues to motivate research in scalable OT-based learning pipelines.