Papers
Topics
Authors
Recent
2000 character limit reached

Generalized Sliced Wasserstein Distance

Updated 11 November 2025
  • Generalized Sliced Wasserstein (GSW) is a metric that extends classical SW by employing nonlinear slicing functions to capture complex data structures.
  • It leverages a generalized Radon transform to project high-dimensional measures into tractable one-dimensional optimal transport problems with efficient algorithms.
  • GSW is applied in generative modeling, set representation, and manifold-valued data analysis, offering improved convergence and enhanced embedding quality.

The Generalized Sliced Wasserstein (GSW) distance is a metric on probability measures that extends the classical Sliced Wasserstein (SW) by leveraging nonlinear "slicing" functions, substantially enhancing representational power and computational flexibility while retaining efficient reduction to one-dimensional optimal transport problems. GSW employs a generalized Radon transform to project measures into one-dimensional spaces via a user-chosen family of functions (such as polynomials, nonlinear neural networks, or geometric transforms), capturing complex data structures omitted by linear projections. It is widely used in generative modeling, manifold-supported learning, and representation pooling for set-structured data.

1. Mathematical Definition and Generalization

Let μ,ν\mu,\nu be two measures on Rd\mathbb R^d. Classical SW computes the pp-Wasserstein distance between their one-dimensional projections, averaged over random directions θ\theta on the sphere: SWp(μ,ν)=(Sd1Wpp(,θ#μ,,θ#ν)dθ)1/p\mathrm{SW}_p(\mu, \nu) = \left( \int_{S^{d-1}} W_p^p(\langle \cdot, \theta \rangle_\# \mu, \langle \cdot, \theta \rangle_\# \nu) d\theta \right)^{1/p} GSW extends this by replacing the inner product with a general defining function g:Rd×ΩRg : \mathbb R^d \times \Omega \to \mathbb R; the projection is now xg(x,θ)x \mapsto g(x, \theta) for parameter θΩ\theta \in \Omega (e.g., a sphere of polynomial coefficients, neural net weights, or geometric descriptors). The GSW distance is

GSWp,H(μ,ν)=(ΩWpp(g(,θ)#μ,g(,θ)#ν)dπ(θ))1/p\mathrm{GSW}_{p, \mathcal H}(\mu,\nu) = \left( \int_\Omega W_p^p(g(\cdot,\theta)_\# \mu, g(\cdot,\theta)_\# \nu) d\pi(\theta) \right)^{1/p}

where π\pi is a probability measure on Ω\Omega. Key choices for gg include homogeneous polynomials---g(x,θ)=α=mθαxαg(x,\theta) = \sum_{|\alpha|=m} \theta_\alpha x^\alpha with mm odd---and circular slices---g(x,θ)=xrθ2g(x, \theta) = \|x - r\theta\|_2 for some rr.

The max-GSW variant selects the single slice θ\theta_* giving maximal separation: max-GSWp,H(μ,ν)=supθΩWp(g(,θ)#μ,g(,θ)#ν)\text{max-GSW}_{p, \mathcal H}(\mu,\nu) = \sup_{\theta \in \Omega} W_p(g(\cdot, \theta)_\# \mu, g(\cdot, \theta)_\# \nu) GSW recovers classical SW for g(x,θ)=x,θg(x, \theta) = \langle x, \theta \rangle and Ω=Sd1\Omega = S^{d-1}.

2. Metric Properties, Topology, and Embedding

If the family {g(,θ)}θΩ\{g(\cdot, \theta)\}_{\theta \in \Omega} is injective in the sense of the generalized Radon transform, GSW satisfies non-negativity, symmetry, triangle inequality, and identity-of-indiscernibles, thus constituting a metric on the space of measures with finite pp-th moments (Kolouri et al., 2019). In the absence of injectivity, it is only a pseudo-metric.

Key theoretical properties:

  • Topology: GSW preserves the weak topology and pp-th moment convergence of WpW_p, provided H\mathcal H is "rich", i.e., sufficiently measure-separating.
  • Isometric Embedding: Given a reference measure μ0\mu_0, the cumulative distribution transform (CDT):

ν^iθ=Fνiθ1Fν0θid\hat{\nu}_i^\theta = F_{\nu_i^\theta}^{-1} \circ F_{\nu_0^\theta} - \mathrm{id}

is an isometric embedding of (Pp,GSWp)(\mathcal P_p, \mathrm{GSW}_p) into Lp(Ωθ;Lp(ν0θ))L^p(\Omega_\theta; L^p(\nu_0^\theta)) (NaderiAlizadeh et al., 2021). Pairwise LpL^p distances between these embedding vectors match true GSW.

3. Computational Algorithms and Fast Approximations

The fundamental advantage of GSW is computational: each projection yields a one-dimensional OT problem, solvable in O(NlogN)O(N \log N) by sorting (for NN samples).

Standard algorithms proceed via Monte Carlo over θ\theta:

1
2
3
4
5
6
7
Input: samples {x_i}, {y_j}; L slices; defining function g
for ℓ=1,…,L:
    Sample θ_ℓ from π
    Project: u_i = g(x_i, θ_ℓ), v_j = g(y_j, θ_ℓ)
    Sort u, v; compute W_p
Aggregate: Empirical mean yields GSW
Output: [average]^{1/p}
[max-GSW] inner loop uses gradient ascent over θ\theta.

Deterministic Approximations. When gg is polynomial or neural-network-based, Le et al. (Le et al., 2022) exploit high-dimensional concentration of measure to replace random projections with closed-form moment computations. The conditional CLT for Gaussian projections bounds the error:

  • Polynomial case: error O(dm/8)O(d^{-m/8}) for mm-th degree
  • Neural case: error O(3n/4d1/4+d1/8)O(3^{n/4} d^{-1/4} + d^{-1/8}) for nn layers The computation reduces to empirical mean and variance of lifted random variables (O(Ndm)O(Nd^m) for polynomials), yielding deterministic GSW estimates as dd increases.

4. Bilevel Optimization and Stein Smoothing

Recent advances embed GSW within bilevel optimization frameworks, particularly min-GSW and min-SWGG. The inner problem minimizes the $1$D transport cost over couplings π\pi, while the outer seeks the optimal projection θ\theta minimizing the (full) Wasserstein cost in the original space (Chapel et al., 28 May 2025): minθRqCμν,πθs.t.πθargminπU(a,b)Cμνϕ(θ),π\min_{\theta \in \mathbb R^q} \langle C_{\mu\nu}, \pi^\theta \rangle \quad \text{s.t.} \quad \pi^\theta \in \arg\min_{\pi \in U(a,b)} \langle C^{\phi}_{\mu\nu}(\theta), \pi \rangle Nonlinear projections ϕ(x,θ)=\phi(x, \theta) = MLP or polynomial increase expressivity, offering tighter couplings in high dimension or manifold-structured data.

Stein Smoothing. Since the outer objective is non-smooth, Stein's lemma yields unbiased gradient estimators for use in first-order optimization. Perturbing θ\theta with Gaussian noise and averaging provides smooth surrogates hε(θ)h_\varepsilon(\theta) for robust optimization. For manifold-parameterized slices (e.g., spheres), the perturbation distribution adapts (von Mises–Fisher).

5. Extensions to Manifold-Valued Data

GSW naturally accommodates data supported on Riemannian manifolds by redefining slices as intrinsic one-dimensional submanifolds (e.g., geodesics, horospheres). In the Poincaré ball model for hyperbolic geometry, horospherical slices are indexed by θSd1\theta \in S^{d-1}: ϕ(x,θ)=logxθ2log(1x2)\phi(x, \theta) = \log \|x - \theta\|^2 - \log(1 - \|x\|^2) with the pushforward giving a one-dimensional measure. The optimization framework remains intact, with gradient flows and Stein smoothing adapted to the manifold setting.

6. Applications in Generative Modeling and Set Representation

GSW has found empirical success in a range of applications:

  • Generative Modeling: Used for gradient flows matching synthetic mixtures and image datasets, GSW with higher-degree polynomials or learned neural slices improves mode matching and accelerates convergence relative to SW (Kolouri et al., 2019, Chapel et al., 28 May 2025).
  • Auto-Encoding: GSWAE and max-GSWAE incorporate GSW regularization to align encoded latent distributions with priors, outperforming SWAE and matching adversarial approaches (WAE-GAN) in latent/decoded Wasserstein distances.
  • Conditional Flow Matching: DGSWP-based CFM achieves lower FID (≈ 3.56) than standard OT-CFM (≈ 4.82), with fewer function evaluations on CIFAR-10 (Chapel et al., 28 May 2025).
  • Set Representation Learning: GSWE layers use the empirical embedding to pool set-structured data, yielding competitive or superior classification and retrieval benchmarks relative to transformer pooling modules (NaderiAlizadeh et al., 2021).

Empirical convergence rates and embedding stability have been characterized, with GSW showing improved qualitative transport plans and more geometrically faithful representations.

7. Limitations, Tradeoffs, and Future Directions

While GSW broadens applicability and computational tractability:

  • If the GRT is not injective, GSW is only a pseudo-metric and may conflate distinct measures.
  • The Monte Carlo approximation incurs variance scaling 1/L\sim 1/\sqrt{L} with LL slices, with tradeoffs in slice richness vs. computational cost.
  • max-GSW optimization is non-convex and sensitive to initialization.
  • Fast deterministic approximations do not extend to all slice families (e.g., circular functions remain open (Le et al., 2022)).
  • Gradient flows with linear or algebraic slices can stall in high-dimensional settings; nonlinear projections ameliorate these issues.

Promising directions include learning parametrized slice families via neural networks, extending manifold embedding techniques, and further theoretical work on sample complexity and convergence rates as dd \to \infty. The utility of GSW in few-step generative modeling and fast mini-batch OT continues to motivate research in scalable OT-based learning pipelines.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Generalized Sliced Wasserstein (GSW).