Papers
Topics
Authors
Recent
2000 character limit reached

Subspace Robust Wasserstein Distance

Updated 11 December 2025
  • Subspace robust Wasserstein distance is a metric that projects probability distributions onto lower-dimensional subspaces to enhance robustness against noise and the curse of dimensionality.
  • It employs convex relaxations and eigenvalue formulations to replace complex non-convex optimization with efficient saddle-point and gradient-based methods.
  • Practical applications include generative modeling, domain adaptation, and two-sample testing, offering dimension-free statistical rates and improved computational tractability.

The subspace robust Wasserstein distance is a family of optimal transport–based metrics designed to robustify Wasserstein distances with respect to noise and the curse of dimensionality, especially in high- or infinite-dimensional settings. The core idea is to measure the transportation cost between two probability distributions after projecting onto lower-dimensional subspaces, but in a worst-case sense, i.e., by optimizing the subspace itself adversarially or over a set of admissible directions. There are distinct formalizations: the "projection-robust" (PRW) Wasserstein, which is a max–min or supremum-over-subspaces of Wasserstein projections, and the "subspace robust" Wasserstein (SRW), which relaxes the order of min and max, leading to a min–max or partial trace (top eigenvalues) cost. These distances interpolate between full-dimensional Wasserstein when k=dk=d and the more statistically stable (sliced or randomized) variants when kdk\ll d, and have sharp statistical, geometric, and computational properties.

1. Formal Definitions and Metric Structure

Let H\mathcal H denote a separable real Hilbert space with norm \|\cdot\|, and μ,ν\mu,\nu be Borel probability measures on H\mathcal H (for most finite-dimensional treatments, H=Rd\mathcal H=\mathbb R^d). The set of couplings Π(μ,ν)\Pi(\mu,\nu) consists of all joint measures on H×H\mathcal H\times\mathcal H with respective marginals μ\mu and ν\nu.

Given integer 1kdimH1\leq k\leq \dim \mathcal H, define Gk\mathcal G_k as the Grassmannian of all kk-dimensional linear subspaces of H\mathcal H. For each EGkE\in \mathcal G_k, PEP_E denotes the orthogonal projection onto EE.

The kk-dimensional subspace robust Wasserstein distance of order pp is

Sk,p(μ,ν):=infπΠ(μ,ν)  supEGk(H×HPE(xy)pdπ(x,y))1/p.S_{k,p}(\mu, \nu) := \inf_{\pi\in\Pi(\mu,\nu)}\;\sup_{E\in\mathcal G_k} \left( \int_{\mathcal H\times\mathcal H} \|P_E(x-y)\|^p \, d\pi(x,y) \right)^{1/p}.

Equivalently, Sk,p(μ,ν)p=infπΠ(μ,ν)  supEGkPE(xy)pdπ(x,y)S_{k,p}(\mu, \nu)^p = \inf_{\pi\in\Pi(\mu,\nu)}\;\sup_{E\in\mathcal G_k} \int \|P_E(x-y)\|^p \, d\pi(x,y) (Vasan, 4 Dec 2025, Paty et al., 2019). For p=2p=2, the notation Sk(μ,ν)S_k(\mu,\nu) is conventional.

A closely related object is the projection-robust Wasserstein distance (PRW, also denoted PRWk\mathrm{PRW}_{k}), defined as

PRWk,p(μ,ν)=supEGkWp(PE#μ,PE#ν),\mathrm{PRW}_{k, p}(\mu,\nu) = \sup_{E\in\mathcal G_k} W_p(P_{E \#} \mu,\, P_{E \#} \nu),

where WpW_p is the classical pp-Wasserstein. In discrete formulations, both can be written as max–min or min–max problems over the subspace and coupling variables (Paty et al., 2019, Jiang et al., 2022, Lin et al., 2020).

Paty–Cuturi (2019) establish that Sk,pS_{k,p} is a bona fide metric for measures with finite ppth moments (Paty et al., 2019).

2. Convex Relaxation and Eigenvalue Formulation

The min–max order in Sk,2S_{k,2} admits a convex relaxation based on partial trace. For a coupling π\pi, define the second moment displacement matrix: Vπ=(xy)(xy)dπ(x,y).V_\pi = \int (x-y)(x-y)^{\top}\,d\pi(x,y). Fan’s maximum principle gives

supEGkPE(xy)2dπ(x,y)=i=1kλi(Vπ),\sup_{E\in \mathcal G_k} \int \|P_E(x-y)\|^2\,d\pi(x,y) = \sum_{i=1}^k \lambda_i(V_\pi),

where λ1λd0\lambda_1 \geq \cdots \geq \lambda_d\geq 0 are the ordered eigenvalues of VπV_\pi.

Thus, the SRW distance admits

Sk2(μ,ν)=minπΠ(μ,ν)i=1kλi(Vπ).S_k^2(\mu, \nu) = \min_{\pi \in \Pi(\mu,\nu)} \sum_{i=1}^k \lambda_i(V_\pi).

This is equivalent to a convex–concave saddle point problem

Sk2(μ,ν)=maxΩRkminπΠ(μ,ν)(xy)Ω(xy)dπ(x,y)S^2_k(\mu, \nu) = \max_{\Omega \in \mathcal R_k} \min_{\pi \in \Pi(\mu,\nu)} \int (x-y)^\top\Omega(x-y) d\pi(x,y)

where Rk={ΩRd×d:0ΩId,TrΩ=k}\mathcal R_k = \{\Omega \in \mathbb R^{d\times d} : 0 \preceq \Omega \preceq I_d,\, \operatorname{Tr} \Omega = k\} (Paty et al., 2019). The order of optimization makes the SRW distance convex in the coupling and over the set of Mahalanobis weights.

For PRW, the non-convex max–min formulation is inherited: PRWk2(μ,ν)=maxUSt(d,k)minπΠ(μ,ν)i,jπijUxiUyj2\mathrm{PRW}_k^2(\mu, \nu) = \max_{U \in \operatorname{St}(d,k)} \min_{\pi \in \Pi(\mu,\nu)} \sum_{i,j} \pi_{ij}\|U^\top x_i - U^\top y_j\|^2 with UU on the Stiefel manifold (Lin et al., 2020, Huang et al., 2020, Jiang et al., 2022).

3. Statistical Properties: Sample Complexity and Convergence

For empirical estimation, let {Xi}i=1n\{X_i\}_{i=1}^n be i.i.d. from μ\mu, with μn=1ni=1nδXi\mu_n = \frac1n\sum_{i=1}^n \delta_{X_i}. In an infinite-dimensional Hilbert space, the main result shows

clognsupμ(ES1(μ,μn)2)1/2Cloglognlogn\frac{c}{\sqrt{\log n}} \leq \sup_{\mu} (\mathbb{E} S_1(\mu, \mu_n)^2)^{1/2} \leq \frac{C \sqrt{\log\log n}}{\sqrt{\log n}}

for universal c,C>0c,C>0 (Vasan, 4 Dec 2025). For general kk,

supμ(ESk(μ,μn)2)1/2kCloglognlogn.\sup_{\mu} (\mathbb{E} S_k(\mu, \mu_n)^2)^{1/2} \leq \sqrt{k} \frac{C \sqrt{\log\log n}}{\sqrt{\log n}}.

The proof proceeds via a decomposition on well-chosen finite-dimensional projections and operator-norm bounds.

The classical Wasserstein rate in Rd\mathbb R^d is O(n1/d)O(n^{-1/d}), which degenerates rapidly for large dd. The SRW and PRW distances, in contrast, have dimension-free rates that depend only logarithmically (SRW, Hilbert setting) or polynomially in kk (PRW, finite dimension) (Lin et al., 2020): E[PRWp,k(μ^n,μ)]n1/(max{2p,k})(logn)ζ,\mathbb{E}\, [\mathrm{PRW}_{p,k}(\hat\mu_n, \mu_\star)] \lesssim n^{-1/(\max\{2p, k\})}(\log n)^\zeta, with technical variations depending on tail assumptions.

The lower bound in S1S_1 is unimprovable up to a loglogn\sqrt{\log\log n} factor (Vasan, 4 Dec 2025).

4. Geometric, Robustness, and Metric Properties

The SRW and PRW distances inherit key properties from optimal transport geometry (Paty et al., 2019):

  • Both are metrics for suitable moment conditions.
  • They satisfy bounds: k/dWp(μ,ν)Sk(μ,ν)Wp(μ,ν)\sqrt{k/d}W_p(\mu,\nu) \leq S_k(\mu,\nu) \leq W_p(\mu,\nu) (tight).
  • Increments in kk are concave—increasing kk improves discrimination at diminishing returns.
  • Dirac consistency: Sk(δx,δy)=xyS_k(\delta_x,\delta_y) = \|x-y\|.
  • Geodesic properties: interpolants along OT plans are geodesics under SkS_k.
  • Stability: trimming away noise directions (smallest dkd-k eigenmodes) yields robustness to high-frequency (isotropic) perturbations.

Empirical studies on synthetic “fragmented hypercube” models and real datasets (e.g., word-embedding distributions from film scripts) show that SkS_k reflects intrinsic low-dimensional structure and clusters semantically similar objects with greater stability to noise and outliers (Paty et al., 2019, Lin et al., 2020).

5. Computational Methods and Algorithms

Computation of SRW and PRW is challenging due to non-convexity (for PRW) and projection-over-subspace maximization. For SRW, convex relaxations and entropic regularization via Sinkhorn's algorithm are central, with two practical algorithms (Paty et al., 2019):

  • Projected non-smooth supergradient ascent: Maximizes over ΩRk\Omega \in \mathcal{R}_k using the supergradient Vπ(Ω)V_{\pi^*(\Omega)}.
  • Frank–Wolfe with entropic regularization: Adds smoothness, with per-iteration complexity O(nm+d3)O(nm+d^3).

For PRW, manifold optimization is employed. The following table summarizes primary algorithms and their properties:

Algorithm Problem Formulation Complexity Bound
RBCD Nonconvex max–min (Stiefel, OT) O(ϵ3)O(\epsilon^{-3}) (Huang et al., 2020)
iRBBS (ReALM) Manifold-constrained ALM O(ϵ3)O(\epsilon^{-3}) (Jiang et al., 2022)
RGAS/RAGAS/RSGAN Riemannian (Entropic/Exact OT) O(ϵ12)O(\epsilon^{-12}) (RGAS), O(ϵ4)O(\epsilon^{-4}) (RSGAN) (Lin et al., 2020)

All methods alternate between subspace updates (Riemannian gradient/ascent/retraction steps on the Stiefel manifold) and cost/coupling updates (via Sinkhorn or network simplex OT solvers). Retractions implemented via QR, polar, Cayley, or exponential maps preserve orthonormality constraints.

Empirical performance (CPU time, convergence) on high-dimensional embedding and image data strongly favors RBCD and ReALM/iRBBS in practical regimes, with speedups 5×5\times30×30\times over older Riemannian methods (Jiang et al., 2022, Huang et al., 2020).

6. Statistical Inference, Applications, and Practical Recommendations

Subspace robust Wasserstein distances have broad applications in generative modeling, domain adaptation, two-sample testing, and minimum-distance parametric inference—especially when data have low intrinsic dimension in high-dimensional ambient spaces (Vasan, 4 Dec 2025, Lin et al., 2020).

Key findings:

  • Minimum PRW estimators are consistent under weak conditions, even under model misspecification. Central limit theorems are established for k=1k=1 (max-sliced) (Lin et al., 2020).
  • Optimal kk should match intrinsic data dimension; k=1k=1 (“sliced”) gives optimal rates O(n1/2)O(n^{-1/2}), higher kk improves geometric fidelity but increases sample and computational complexity.
  • SRW is more computationally tractable via convex relaxation; PRW is statistically more powerful but non-convex.
  • Empirical performance demonstrates robustness to noise and improved clustering for word embedding and high-dimensional data tasks.

Averaged variants, such as Integral PRW (IPRW, based on integration instead of supremum over subspaces), offer smoother, easier-to-estimate models but are less discriminative (Lin et al., 2020).

7. Open Problems and Future Directions

Significant open questions and directions include (Vasan, 4 Dec 2025, Jiang et al., 2022, Lin et al., 2020):

  • Tightening statistical rates (removal of loglogn\sqrt{\log\log n} factors, high-probability bounds).
  • Extension to general order p>2p > 2, requiring Schatten-pp norm estimates.
  • Minimax optimality under weaker moment or heavy-tailed assumptions.
  • Theory and algorithms for continuous (non-discrete) measures.
  • Acceleration via Riemannian trust-region or momentum techniques.
  • Tuning and automation of entropic regularization parameters.
  • Extensions to barycenter computation, deep generative models, and distributed or streaming data scenarios.
  • Understanding global optimality and landscape of the PRW optimization problem.

These lines of inquiry are central for further development of high-dimensional robust optimal transport and its application to modern statistical and machine learning tasks.


References:

  • (Vasan, 4 Dec 2025): Vasan (2024), "Convergence rate of empirical measures in the subspace robust Wasserstein distance."
  • (Paty et al., 2019): Paty–Cuturi (2019), "Subspace Robust Wasserstein Distances."
  • (Jiang et al., 2022): ReALM, iRBBS: Riemannian exponential augmented Lagrangian method for PRW computation.
  • (Huang et al., 2020): Huang et al., RBCD algorithm.
  • (Lin et al., 2020): Riemannian optimization and computational theory for PRW.
  • (Lin et al., 2020): Lin–Ho, "On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification."

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Subspace Robust Wasserstein Distance.