Mixed Sliced Wasserstein (Mix-SW)

Updated 26 January 2026

Mix-SW is an optimal transport distance that extends the classical sliced Wasserstein by using adaptive, data-driven slice distributions for improved discrimination.
It interpolates between uniform slicing and maximal slicing, with rigorous theoretical bounds and formulations for both general probability spaces and Gaussian mixture models.
Efficient computation via Monte Carlo sampling and PAC-Bayesian analysis demonstrates Mix-SW’s practical impact in clustering, generative modeling, and perceptual evaluation.

Mixed Sliced Wasserstein (Mix-SW) describes a family of optimal transport distances extending the Sliced Wasserstein (SW) distance by introducing adaptive or data-driven slice distributions. These distances interpolate between the uniform slicing used in traditional SW, the maximal slicing from max-SW, and arbitrary mixed or learnable distributions, conferring greater discriminative and computational efficacy. Notably, Mix-SW has been formulated and analyzed in general probability spaces (Ohana et al., 2022) and for Gaussian Mixture Models (GMMs) in high-dimensional machine learning settings (Piening et al., 11 Apr 2025).

1. Formal Definition and Key Variants

The Mix-SW distance operates by projecting high-dimensional probability distributions onto one-dimensional subspaces and aggregating the Wasserstein distances of these projections according to a slice distribution. Given two probability measures $P, Q$ on $\mathbb{R}^d$ and a probability measure (slice distribution) $\rho$ on the unit sphere $S = \{\theta \in \mathbb{R}^d: \|\theta\| = 1\}$ ,

$\mathrm{SW}_\rho(P, Q) = \mathbb{E}_{\theta \sim \rho}\left[W_1(\mathrm{Proj}_\theta P, \mathrm{Proj}_\theta Q)\right],$

where $\mathrm{Proj}_\theta P$ is the push-forward of $P$ by $x \mapsto \langle \theta, x\rangle$ and $W_1$ is the 1-Wasserstein distance in one dimension. The classical sliced Wasserstein is recovered for $\rho = \mathrm{Unif}(S)$ , while max-SW corresponds to optimizing $\rho$ to be a Dirac mass.

In the context of Gaussian mixtures, with mixtures $\mu = \sum_{i=1}^k w_i \mathcal{N}(m_i, \Sigma_i)$ and $\nu = \sum_{j=1}^\ell v_j \mathcal{N}(n_j, \Lambda_j)$ ,

$\operatorname{MixSW}(\mu, \nu) := \sup_{\Pi \in \mathcal{P}(S^{d-1})} \int_{S^{d-1}} W_1(\pi_{\theta \sharp} \mu, \pi_{\theta \sharp} \nu)\, d\Pi(\theta),$

where $\pi_{\theta \sharp} \mu$ denotes the push-forward mixture in one dimension and $\mathcal{P}(S^{d-1})$ denotes the set of probability measures on the sphere (Piening et al., 11 Apr 2025).

2. Theoretical Properties and Relationships

Mix-SW distances interpolate between SW and the Mixture-Wasserstein (MW) distance. The following inequalities hold: $\mathrm{SW}_1(\mu, \nu) \leq \operatorname{MixSW}(\mu, \nu) \leq \mathrm{MW}(\mu, \nu),$ with SW as the average over uniform directions and MW involving discrete coupling of GMM components via the quadratic Gaussian cost (Piening et al., 11 Apr 2025).

Notably, all distances mentioned metrize the same topology on the space of probability measures with compactly supported moments: specifically, Mix-SW induces the same weak topology as SW and MW. On compact sets for the parameters of GMMs, strong (norm) equivalence can be shown between SW and MixSW and thus between MixSW and MW up to constants.

In the adaptive SW case, PAC-Bayesian generalization bounds of the following form are established (Ohana et al., 2022): with high probability over empirical samples,

$\mathrm{SW}_\rho(\mu, \nu) \geq \mathrm{SW}_\rho(\mu_n, \nu_n) - \frac{\lambda}{n}\,\varphi_{\mu, \nu} - \frac{1}{\lambda}\{\mathrm{KL}(\rho \| \rho_0) + \log(1/\delta)\} - \psi_{\mu, \nu}(n),$

where $\mathrm{SW}_\rho(\mu_n, \nu_n)$ is the empirical estimate, $\varphi_{\mu, \nu}$ reflects a variance term, $\psi_{\mu, \nu}(n)$ a bias term, and $\rho_0$ a slice-prior.

3. Computation and Algorithmic Frameworks

Efficient computation of Mix-SW leverages the closed form of $W_1$ in 1D and the tractability of projecting data and mixtures. Given empirical measures,

$\mathrm{Proj}_\theta(P_n) = \frac{1}{n}\sum_{i=1}^n \delta_{\theta^\top x_i}.$

The one-dimensional Wasserstein distance between empirical measures admits the expression

$W_1\left(\frac{1}{n} \sum \delta_{u_i}, \frac{1}{n} \sum \delta_{v_i}\right) = \frac{1}{n}\sum_{i=1}^n |u_{(i)} - v_{(i)}|$

with $u_{(i)}, v_{(i)}$ sorted, so each projection costs $O(n \log n)$ (Ohana et al., 2022).

For GMMs, the supremum in Mix-SW is typically approximated by Monte Carlo sampling $L$ directions $\theta_1,\ldots,\theta_L$ ,

$\widehat{\operatorname{MixSW}}_L(\mu, \nu) = \max_{\ell=1,\ldots, L} W_1(\pi_{\theta_\ell \sharp} \mu, \pi_{\theta_\ell \sharp} \nu),$

where $W_1$ between discrete mixtures is computed via sorting, $O(K\log K)$ for $K$ components per mixture, leading to total complexity $O(L K \log K)$ (Piening et al., 11 Apr 2025).

In adaptive SW settings, the slice distribution $\rho$ can be parameterized (e.g., as a von Mises–Fisher distribution) and learned via stochastic gradient ascent. This enables optimization of slice selection under a PAC-Bayesian regularized objective,

$\max_{\rho \in \mathcal{F}}\left\{\mathrm{SW}_\rho(\mu_n, \nu_n) - \frac{1}{\lambda} \mathrm{KL}(\rho\|\rho_0)\right\}.$

4. Statistical Generalization and PAC-Bayesian Analysis

PAC-Bayesian theory provides finite-sample generalization guarantees for adaptive SW, capturing the impact of slice distribution learning on statistical performance (Ohana et al., 2022). The theorems link the generalization gap between empirical and population Mix-SW to a combination of: the empirical fit, slice variance, the KL-divergence between learned and prior slice distributions, and a sample-complexity bias term (typically $O(n^{-1/2})$ under compact support).

Explicit parameter choices for variance and complexity can be made, e.g., for support in a ball of radius $\Delta$ , one can set

$\varphi_{\mu, \nu} = \frac{\Delta^2}{2}, \quad \psi_{\mu, \nu}(n) = C \Delta n^{-1/2}.$

Optimizing the PAC-Bayesian lower bound with respect to $\rho$ leads to more discriminative, data-driven distinctions between distributions.

5. Empirical Performance and Applications

Mix-SW and adaptive SW approaches have been empirically validated in synthetic and real-world scenarios spanning clustering, generative modeling, and perceptual metrics:

Gaussian separation: In synthetic $\mathbb{R}^d$ Gaussian distributions, adaptive Mix-SW (PAC-SW, DSW) achieves higher discrimination and generalization than uniform SW; max-SW is unstable (Ohana et al., 2022).
Fashion-MNIST: In class differentiation tasks, PAC-SW yields larger test SW, indicating enhanced discrimination (Ohana et al., 2022).
Generative models: As a loss for GANs or VAEs, using learned slice distributions accelerates generator training compared to uniform SW or max-SW, even when the slice distribution is updated infrequently (Ohana et al., 2022).
Clustering and cluster detection: MixSW $(\mu_k, \mu_{k+1})$ displays a sharp drop at the true number of clusters in GMMs, supporting cluster number recovery (Piening et al., 11 Apr 2025).
Perceptual image metrics (WaM): Replacing MW with MixSW yields nearly identical perceptual evaluations while reducing computation time by orders of magnitude (Piening et al., 11 Apr 2025).
GMM minimization and barycenters: MixSW allows efficient quantization and barycenter computation in settings where MW is computationally prohibitive (Piening et al., 11 Apr 2025).

A summary of reported empirical advantages is as follows:

Task	Uniform SW	Max-SW	DSW/PAC-SW/MixSW
Gaussian separation	Slow, weak	Unstable	High discrimination
Fashion-MNIST discrimination	Low	Unstable	Higher test distance
GAN/VAE training	Slower	Unstable	Faster, robust convergence
Clustering (GMM)	–	–	Sharp clusters detected
Perceptual metrics (WaM)	–	–	Fast, matches MW curves

6. Computational Complexity and Scalability

The main practical advantage of Mix-SW lies in computational scalability. For $K$ -component GMMs and $L$ projections:

MixSW: $O(L K \log K)$ (Piening et al., 11 Apr 2025)
MW: $O(K^3 + K^2 d^3)$ (includes linear-programming over weights and spectrally-decomposed Gaussian cost)

Empirical results show that $L = 100$ –$500$ suffices for high-precision approximation, and Mix-SW reduces runtime from several minutes to a few seconds relative to MW in perceptual pipelines. The approach is highly amenable to vectorized and parallel implementation in numerical software (e.g., Python, POT, NumPy/PyTorch) (Piening et al., 11 Apr 2025).

7. Outlook and Extensions

The Mix-SW paradigm encompasses both random-slice (Monte Carlo) and learned or adaptive slice distributions, with extensions allowing for more expressive classes (e.g., neural net push-forwards for $\rho$ ), provided that the necessary variational objectives and empirical approximations are tractable (Ohana et al., 2022). Its flexible formulation, theoretical guarantees, and significant empirical acceleration in high-dimensional problems indicate broad applicability across clustering, distribution comparison, image processing, and generative models.

A plausible implication is that further exploration of richer slice families and integration with modern optimal transport solvers could yield additional improvements in both statistical power and computational flexibility. The distinctions between uniform, maximum, and mixed slicing highlight key trade-offs between stability, generalization, and computational burden in Wasserstein-based distances.

Markdown Report Issue Upgrade to Chat

References (2)

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances (2022)

Slicing the Gaussian Mixture Wasserstein Distance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixed Sliced Wasserstein (Mix-SW).