Subspace Robust Wasserstein Distance

Updated 11 December 2025

Subspace robust Wasserstein distance is a metric that projects probability distributions onto lower-dimensional subspaces to enhance robustness against noise and the curse of dimensionality.
It employs convex relaxations and eigenvalue formulations to replace complex non-convex optimization with efficient saddle-point and gradient-based methods.
Practical applications include generative modeling, domain adaptation, and two-sample testing, offering dimension-free statistical rates and improved computational tractability.

The subspace robust Wasserstein distance is a family of optimal transport–based metrics designed to robustify Wasserstein distances with respect to noise and the curse of dimensionality, especially in high- or infinite-dimensional settings. The core idea is to measure the transportation cost between two probability distributions after projecting onto lower-dimensional subspaces, but in a worst-case sense, i.e., by optimizing the subspace itself adversarially or over a set of admissible directions. There are distinct formalizations: the "projection-robust" (PRW) Wasserstein, which is a max–min or supremum-over-subspaces of Wasserstein projections, and the "subspace robust" Wasserstein (SRW), which relaxes the order of min and max, leading to a min–max or partial trace (top eigenvalues) cost. These distances interpolate between full-dimensional Wasserstein when $k=d$ and the more statistically stable (sliced or randomized) variants when $k\ll d$ , and have sharp statistical, geometric, and computational properties.

1. Formal Definitions and Metric Structure

Let $\mathcal H$ denote a separable real Hilbert space with norm $\|\cdot\|$ , and $\mu,\nu$ be Borel probability measures on $\mathcal H$ (for most finite-dimensional treatments, $\mathcal H=\mathbb R^d$ ). The set of couplings $\Pi(\mu,\nu)$ consists of all joint measures on $\mathcal H\times\mathcal H$ with respective marginals $\mu$ and $\nu$ .

Given integer $1\leq k\leq \dim \mathcal H$ , define $\mathcal G_k$ as the Grassmannian of all $k$ -dimensional linear subspaces of $\mathcal H$ . For each $E\in \mathcal G_k$ , $P_E$ denotes the orthogonal projection onto $E$ .

The $k$ -dimensional subspace robust Wasserstein distance of order $p$ is

$S_{k,p}(\mu, \nu) := \inf_{\pi\in\Pi(\mu,\nu)}\;\sup_{E\in\mathcal G_k} \left( \int_{\mathcal H\times\mathcal H} \|P_E(x-y)\|^p \, d\pi(x,y) \right)^{1/p}.$

Equivalently, $S_{k,p}(\mu, \nu)^p = \inf_{\pi\in\Pi(\mu,\nu)}\;\sup_{E\in\mathcal G_k} \int \|P_E(x-y)\|^p \, d\pi(x,y)$ (Vasan, 4 Dec 2025, Paty et al., 2019). For $p=2$ , the notation $S_k(\mu,\nu)$ is conventional.

A closely related object is the projection-robust Wasserstein distance (PRW, also denoted $\mathrm{PRW}_{k}$ ), defined as

$\mathrm{PRW}_{k, p}(\mu,\nu) = \sup_{E\in\mathcal G_k} W_p(P_{E \#} \mu,\, P_{E \#} \nu),$

where $W_p$ is the classical $p$ -Wasserstein. In discrete formulations, both can be written as max–min or min–max problems over the subspace and coupling variables (Paty et al., 2019, Jiang et al., 2022, Lin et al., 2020).

Paty–Cuturi (2019) establish that $S_{k,p}$ is a bona fide metric for measures with finite $p$ th moments (Paty et al., 2019).

2. Convex Relaxation and Eigenvalue Formulation

The min–max order in $S_{k,2}$ admits a convex relaxation based on partial trace. For a coupling $\pi$ , define the second moment displacement matrix: $V_\pi = \int (x-y)(x-y)^{\top}\,d\pi(x,y).$ Fan’s maximum principle gives

$\sup_{E\in \mathcal G_k} \int \|P_E(x-y)\|^2\,d\pi(x,y) = \sum_{i=1}^k \lambda_i(V_\pi),$

where $\lambda_1 \geq \cdots \geq \lambda_d\geq 0$ are the ordered eigenvalues of $V_\pi$ .

Thus, the SRW distance admits

$S_k^2(\mu, \nu) = \min_{\pi \in \Pi(\mu,\nu)} \sum_{i=1}^k \lambda_i(V_\pi).$

This is equivalent to a convex–concave saddle point problem

$S^2_k(\mu, \nu) = \max_{\Omega \in \mathcal R_k} \min_{\pi \in \Pi(\mu,\nu)} \int (x-y)^\top\Omega(x-y) d\pi(x,y)$

where $\mathcal R_k = \{\Omega \in \mathbb R^{d\times d} : 0 \preceq \Omega \preceq I_d,\, \operatorname{Tr} \Omega = k\}$ (Paty et al., 2019). The order of optimization makes the SRW distance convex in the coupling and over the set of Mahalanobis weights.

For PRW, the non-convex max–min formulation is inherited: $\mathrm{PRW}_k^2(\mu, \nu) = \max_{U \in \operatorname{St}(d,k)} \min_{\pi \in \Pi(\mu,\nu)} \sum_{i,j} \pi_{ij}\|U^\top x_i - U^\top y_j\|^2$ with $U$ on the Stiefel manifold (Lin et al., 2020, Huang et al., 2020, Jiang et al., 2022).

3. Statistical Properties: Sample Complexity and Convergence

For empirical estimation, let $\{X_i\}_{i=1}^n$ be i.i.d. from $\mu$ , with $\mu_n = \frac1n\sum_{i=1}^n \delta_{X_i}$ . In an infinite-dimensional Hilbert space, the main result shows

$\frac{c}{\sqrt{\log n}} \leq \sup_{\mu} (\mathbb{E} S_1(\mu, \mu_n)^2)^{1/2} \leq \frac{C \sqrt{\log\log n}}{\sqrt{\log n}}$

for universal $c,C>0$ (Vasan, 4 Dec 2025). For general $k$ ,

$\sup_{\mu} (\mathbb{E} S_k(\mu, \mu_n)^2)^{1/2} \leq \sqrt{k} \frac{C \sqrt{\log\log n}}{\sqrt{\log n}}.$

The proof proceeds via a decomposition on well-chosen finite-dimensional projections and operator-norm bounds.

The classical Wasserstein rate in $\mathbb R^d$ is $O(n^{-1/d})$ , which degenerates rapidly for large $d$ . The SRW and PRW distances, in contrast, have dimension-free rates that depend only logarithmically (SRW, Hilbert setting) or polynomially in $k$ (PRW, finite dimension) (Lin et al., 2020): $\mathbb{E}\, [\mathrm{PRW}_{p,k}(\hat\mu_n, \mu_\star)] \lesssim n^{-1/(\max\{2p, k\})}(\log n)^\zeta,$ with technical variations depending on tail assumptions.

The lower bound in $S_1$ is unimprovable up to a $\sqrt{\log\log n}$ factor (Vasan, 4 Dec 2025).

4. Geometric, Robustness, and Metric Properties

The SRW and PRW distances inherit key properties from optimal transport geometry (Paty et al., 2019):

Both are metrics for suitable moment conditions.
They satisfy bounds: $\sqrt{k/d}W_p(\mu,\nu) \leq S_k(\mu,\nu) \leq W_p(\mu,\nu)$ (tight).
Increments in $k$ are concave—increasing $k$ improves discrimination at diminishing returns.
Dirac consistency: $S_k(\delta_x,\delta_y) = \|x-y\|$ .
Geodesic properties: interpolants along OT plans are geodesics under $S_k$ .
Stability: trimming away noise directions (smallest $d-k$ eigenmodes) yields robustness to high-frequency (isotropic) perturbations.

Empirical studies on synthetic “fragmented hypercube” models and real datasets (e.g., word-embedding distributions from film scripts) show that $S_k$ reflects intrinsic low-dimensional structure and clusters semantically similar objects with greater stability to noise and outliers (Paty et al., 2019, Lin et al., 2020).

5. Computational Methods and Algorithms

Computation of SRW and PRW is challenging due to non-convexity (for PRW) and projection-over-subspace maximization. For SRW, convex relaxations and entropic regularization via Sinkhorn's algorithm are central, with two practical algorithms (Paty et al., 2019):

Projected non-smooth supergradient ascent: Maximizes over $\Omega \in \mathcal{R}_k$ using the supergradient $V_{\pi^*(\Omega)}$ .
Frank–Wolfe with entropic regularization: Adds smoothness, with per-iteration complexity $O(nm+d^3)$ .

For PRW, manifold optimization is employed. The following table summarizes primary algorithms and their properties:

Algorithm	Problem Formulation	Complexity Bound
RBCD	Nonconvex max–min (Stiefel, OT)	$O(\epsilon^{-3})$ (Huang et al., 2020)
iRBBS (ReALM)	Manifold-constrained ALM	$O(\epsilon^{-3})$ (Jiang et al., 2022)
RGAS/RAGAS/RSGAN	Riemannian (Entropic/Exact OT)	$O(\epsilon^{-12})$ (RGAS), $O(\epsilon^{-4})$ (RSGAN) (Lin et al., 2020)

All methods alternate between subspace updates (Riemannian gradient/ascent/retraction steps on the Stiefel manifold) and cost/coupling updates (via Sinkhorn or network simplex OT solvers). Retractions implemented via QR, polar, Cayley, or exponential maps preserve orthonormality constraints.

Empirical performance (CPU time, convergence) on high-dimensional embedding and image data strongly favors RBCD and ReALM/iRBBS in practical regimes, with speedups $5\times$ – $30\times$ over older Riemannian methods (Jiang et al., 2022, Huang et al., 2020).

6. Statistical Inference, Applications, and Practical Recommendations

Subspace robust Wasserstein distances have broad applications in generative modeling, domain adaptation, two-sample testing, and minimum-distance parametric inference—especially when data have low intrinsic dimension in high-dimensional ambient spaces (Vasan, 4 Dec 2025, Lin et al., 2020).

Key findings:

Minimum PRW estimators are consistent under weak conditions, even under model misspecification. Central limit theorems are established for $k=1$ (max-sliced) (Lin et al., 2020).
Optimal $k$ should match intrinsic data dimension; $k=1$ (“sliced”) gives optimal rates $O(n^{-1/2})$ , higher $k$ improves geometric fidelity but increases sample and computational complexity.
SRW is more computationally tractable via convex relaxation; PRW is statistically more powerful but non-convex.
Empirical performance demonstrates robustness to noise and improved clustering for word embedding and high-dimensional data tasks.

Averaged variants, such as Integral PRW (IPRW, based on integration instead of supremum over subspaces), offer smoother, easier-to-estimate models but are less discriminative (Lin et al., 2020).

7. Open Problems and Future Directions

Significant open questions and directions include (Vasan, 4 Dec 2025, Jiang et al., 2022, Lin et al., 2020):

Tightening statistical rates (removal of $\sqrt{\log\log n}$ factors, high-probability bounds).
Extension to general order $p > 2$ , requiring Schatten- $p$ norm estimates.
Minimax optimality under weaker moment or heavy-tailed assumptions.
Theory and algorithms for continuous (non-discrete) measures.
Acceleration via Riemannian trust-region or momentum techniques.
Tuning and automation of entropic regularization parameters.
Extensions to barycenter computation, deep generative models, and distributed or streaming data scenarios.
Understanding global optimality and landscape of the PRW optimization problem.

These lines of inquiry are central for further development of high-dimensional robust optimal transport and its application to modern statistical and machine learning tasks.

References:

(Vasan, 4 Dec 2025): Vasan (2024), "Convergence rate of empirical measures in the subspace robust Wasserstein distance."
(Paty et al., 2019): Paty–Cuturi (2019), "Subspace Robust Wasserstein Distances."
(Jiang et al., 2022): ReALM, iRBBS: Riemannian exponential augmented Lagrangian method for PRW computation.
(Huang et al., 2020): Huang et al., RBCD algorithm.
(Lin et al., 2020): Riemannian optimization and computational theory for PRW.
(Lin et al., 2020): Lin–Ho, "On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification."