Subspace Robust Wasserstein Distance
- Subspace robust Wasserstein distance is a metric that projects probability distributions onto lower-dimensional subspaces to enhance robustness against noise and the curse of dimensionality.
- It employs convex relaxations and eigenvalue formulations to replace complex non-convex optimization with efficient saddle-point and gradient-based methods.
- Practical applications include generative modeling, domain adaptation, and two-sample testing, offering dimension-free statistical rates and improved computational tractability.
The subspace robust Wasserstein distance is a family of optimal transport–based metrics designed to robustify Wasserstein distances with respect to noise and the curse of dimensionality, especially in high- or infinite-dimensional settings. The core idea is to measure the transportation cost between two probability distributions after projecting onto lower-dimensional subspaces, but in a worst-case sense, i.e., by optimizing the subspace itself adversarially or over a set of admissible directions. There are distinct formalizations: the "projection-robust" (PRW) Wasserstein, which is a max–min or supremum-over-subspaces of Wasserstein projections, and the "subspace robust" Wasserstein (SRW), which relaxes the order of min and max, leading to a min–max or partial trace (top eigenvalues) cost. These distances interpolate between full-dimensional Wasserstein when and the more statistically stable (sliced or randomized) variants when , and have sharp statistical, geometric, and computational properties.
1. Formal Definitions and Metric Structure
Let denote a separable real Hilbert space with norm , and be Borel probability measures on (for most finite-dimensional treatments, ). The set of couplings consists of all joint measures on with respective marginals and .
Given integer , define as the Grassmannian of all -dimensional linear subspaces of . For each , denotes the orthogonal projection onto .
The -dimensional subspace robust Wasserstein distance of order is
Equivalently, (Vasan, 4 Dec 2025, Paty et al., 2019). For , the notation is conventional.
A closely related object is the projection-robust Wasserstein distance (PRW, also denoted ), defined as
where is the classical -Wasserstein. In discrete formulations, both can be written as max–min or min–max problems over the subspace and coupling variables (Paty et al., 2019, Jiang et al., 2022, Lin et al., 2020).
Paty–Cuturi (2019) establish that is a bona fide metric for measures with finite th moments (Paty et al., 2019).
2. Convex Relaxation and Eigenvalue Formulation
The min–max order in admits a convex relaxation based on partial trace. For a coupling , define the second moment displacement matrix: Fan’s maximum principle gives
where are the ordered eigenvalues of .
Thus, the SRW distance admits
This is equivalent to a convex–concave saddle point problem
where (Paty et al., 2019). The order of optimization makes the SRW distance convex in the coupling and over the set of Mahalanobis weights.
For PRW, the non-convex max–min formulation is inherited: with on the Stiefel manifold (Lin et al., 2020, Huang et al., 2020, Jiang et al., 2022).
3. Statistical Properties: Sample Complexity and Convergence
For empirical estimation, let be i.i.d. from , with . In an infinite-dimensional Hilbert space, the main result shows
for universal (Vasan, 4 Dec 2025). For general ,
The proof proceeds via a decomposition on well-chosen finite-dimensional projections and operator-norm bounds.
The classical Wasserstein rate in is , which degenerates rapidly for large . The SRW and PRW distances, in contrast, have dimension-free rates that depend only logarithmically (SRW, Hilbert setting) or polynomially in (PRW, finite dimension) (Lin et al., 2020): with technical variations depending on tail assumptions.
The lower bound in is unimprovable up to a factor (Vasan, 4 Dec 2025).
4. Geometric, Robustness, and Metric Properties
The SRW and PRW distances inherit key properties from optimal transport geometry (Paty et al., 2019):
- Both are metrics for suitable moment conditions.
- They satisfy bounds: (tight).
- Increments in are concave—increasing improves discrimination at diminishing returns.
- Dirac consistency: .
- Geodesic properties: interpolants along OT plans are geodesics under .
- Stability: trimming away noise directions (smallest eigenmodes) yields robustness to high-frequency (isotropic) perturbations.
Empirical studies on synthetic “fragmented hypercube” models and real datasets (e.g., word-embedding distributions from film scripts) show that reflects intrinsic low-dimensional structure and clusters semantically similar objects with greater stability to noise and outliers (Paty et al., 2019, Lin et al., 2020).
5. Computational Methods and Algorithms
Computation of SRW and PRW is challenging due to non-convexity (for PRW) and projection-over-subspace maximization. For SRW, convex relaxations and entropic regularization via Sinkhorn's algorithm are central, with two practical algorithms (Paty et al., 2019):
- Projected non-smooth supergradient ascent: Maximizes over using the supergradient .
- Frank–Wolfe with entropic regularization: Adds smoothness, with per-iteration complexity .
For PRW, manifold optimization is employed. The following table summarizes primary algorithms and their properties:
| Algorithm | Problem Formulation | Complexity Bound |
|---|---|---|
| RBCD | Nonconvex max–min (Stiefel, OT) | (Huang et al., 2020) |
| iRBBS (ReALM) | Manifold-constrained ALM | (Jiang et al., 2022) |
| RGAS/RAGAS/RSGAN | Riemannian (Entropic/Exact OT) | (RGAS), (RSGAN) (Lin et al., 2020) |
All methods alternate between subspace updates (Riemannian gradient/ascent/retraction steps on the Stiefel manifold) and cost/coupling updates (via Sinkhorn or network simplex OT solvers). Retractions implemented via QR, polar, Cayley, or exponential maps preserve orthonormality constraints.
Empirical performance (CPU time, convergence) on high-dimensional embedding and image data strongly favors RBCD and ReALM/iRBBS in practical regimes, with speedups – over older Riemannian methods (Jiang et al., 2022, Huang et al., 2020).
6. Statistical Inference, Applications, and Practical Recommendations
Subspace robust Wasserstein distances have broad applications in generative modeling, domain adaptation, two-sample testing, and minimum-distance parametric inference—especially when data have low intrinsic dimension in high-dimensional ambient spaces (Vasan, 4 Dec 2025, Lin et al., 2020).
Key findings:
- Minimum PRW estimators are consistent under weak conditions, even under model misspecification. Central limit theorems are established for (max-sliced) (Lin et al., 2020).
- Optimal should match intrinsic data dimension; (“sliced”) gives optimal rates , higher improves geometric fidelity but increases sample and computational complexity.
- SRW is more computationally tractable via convex relaxation; PRW is statistically more powerful but non-convex.
- Empirical performance demonstrates robustness to noise and improved clustering for word embedding and high-dimensional data tasks.
Averaged variants, such as Integral PRW (IPRW, based on integration instead of supremum over subspaces), offer smoother, easier-to-estimate models but are less discriminative (Lin et al., 2020).
7. Open Problems and Future Directions
Significant open questions and directions include (Vasan, 4 Dec 2025, Jiang et al., 2022, Lin et al., 2020):
- Tightening statistical rates (removal of factors, high-probability bounds).
- Extension to general order , requiring Schatten- norm estimates.
- Minimax optimality under weaker moment or heavy-tailed assumptions.
- Theory and algorithms for continuous (non-discrete) measures.
- Acceleration via Riemannian trust-region or momentum techniques.
- Tuning and automation of entropic regularization parameters.
- Extensions to barycenter computation, deep generative models, and distributed or streaming data scenarios.
- Understanding global optimality and landscape of the PRW optimization problem.
These lines of inquiry are central for further development of high-dimensional robust optimal transport and its application to modern statistical and machine learning tasks.
References:
- (Vasan, 4 Dec 2025): Vasan (2024), "Convergence rate of empirical measures in the subspace robust Wasserstein distance."
- (Paty et al., 2019): Paty–Cuturi (2019), "Subspace Robust Wasserstein Distances."
- (Jiang et al., 2022): ReALM, iRBBS: Riemannian exponential augmented Lagrangian method for PRW computation.
- (Huang et al., 2020): Huang et al., RBCD algorithm.
- (Lin et al., 2020): Riemannian optimization and computational theory for PRW.
- (Lin et al., 2020): Lin–Ho, "On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification."