Wasserstein Projection Estimator (WPE)

Updated 15 November 2025

WPE is a projection-based optimal transport technique that leverages the 2-Wasserstein distance to robustly extract lower-dimensional structures from high-dimensional distributions.
It employs entropy regularization and Sinkhorn iterations to reformulate the nested optimization, ensuring smooth and computationally efficient solutions.
Riemannian block coordinate descent algorithms significantly reduce complexity, making WPE practical for applications in robust statistics, dimensionality reduction, and large-scale data analysis.

The Wasserstein Projection Estimator (WPE) is a statistical and optimization framework that leverages the Wasserstein distance to perform robust, low-dimensional, or constraint-respecting inference on probability distributions, particularly in high-dimensional settings. WPEs have emerged in diverse formulations—from optimally projecting empirical distributions onto parametric or structural sets, to serving as the computational core of dimensionality-reducing robust statistics, to acting as a key ingredient in statistical tests and optimization tasks. The formalization and efficient computation of WPEs is vital both for overcoming the curse of dimensionality and for extending the reach of optimal transport–based methods in contemporary data sciences.

1. Core Definition and Formulation

Let $\mu,\nu$ be probability measures on $\mathbb{R}^d$ . For $k\leq d$ , let $U\in\mathrm{St}(d,k)$ be an element of the Stiefel manifold (the set of $d\times k$ real matrices with $U^\top U=I_k$ ). Define the orthogonal projection $P_U(x)=UU^\top x$ , and denote the push-forward distribution $P_U\#\mu$ as the law of $P_U(X)$ for $X\sim\mu$ . The $k$ -dimensional projection-robust Wasserstein distance (PRW) is

$P_k(\mu,\nu) = \max_{U\in\mathrm{St}(d,k)} W_2(P_U\#\mu, P_U\#\nu)$

This defines the WPE: it selects the $k$ -dimensional subspace maximizing the 2-Wasserstein distance between the projected versions of $\mu$ and $\nu$ . In empirical settings with finite-support measures $\mu_n = \sum_{i=1}^n r_i\delta_{x_i}$ and $\nu_n = \sum_{j=1}^n c_j\delta_{y_j}$ , the discrete optimization becomes

$P_k^2(\mu_n, \nu_n) = \max_{U\in\mathrm{St}(d,k)} \min_{\Pi\in\Pi(\mu_n,\nu_n)} \sum_{i,j} \Pi_{ij} \|U^\top x_i - U^\top y_j\|^2$

This is a max–min problem over matrices $U$ and transportation plans $\Pi$ under marginal constraints $\Pi 1 = r$ , $\Pi^\top 1 = c$ .

2. Regularization and Computational Reformulation

Direct solution is computationally intensive due to the nested max–min structure. Entropy regularization is introduced: $H(\Pi) = -\sum_{ij} (\Pi_{ij} \log \Pi_{ij} - \Pi_{ij})$ yielding the regularized problem

$\max_{U\in\mathrm{St}(d,k)} p(U), \qquad p(U) = \min_{\Pi \in \Pi(\mu_n,\nu_n)} \sum_{i,j} \Pi_{ij}\|U^\top x_i - U^\top y_j\|^2 - \eta H(\Pi)$

where $\eta > 0$ controls the entropic bias. The regularized formulation facilitates smoothness and efficient computation, notably enabling the application of Sinkhorn iteration for the inner OT problem.

A dual reformulation introduces dual variables $u, v\in\mathbb{R}^n$ such that

$\zeta(u,v,U)_{ij} = \exp\left[-\frac{1}{\eta} \|U^\top x_i - U^\top y_j\|^2 + u_i + v_j \right]$

and $\pi(u,v,U) = \zeta(u,v,U)/\|\zeta(u,v,U)\|_1$ , with the objective function

$g(u,v,U) = \log\|\zeta(u,v,U)\|_1 - r^\top u - c^\top v$

The problem is reframed as minimizing $g(u,v,U)$ jointly in $(u,v,U)$ , enabling block coordinate descent strategies with closed-form Sinkhorn updates in $u$ (rows) and $v$ (columns).

3. Riemannian Block Coordinate Descent Algorithm

The RBCD algorithm alternates between Sinkhorn updates for dual variables and a Riemannian gradient step for the projection $U$ , exploiting the geometry of $\mathrm{St}(d,k)$ . The pseudocode is:

Initialize $u^0,v^0,U^0$ .
Repeat (for each iteration $t$ $t$ ):
1. $u^{t+1} = u^t + \log\left(r./(\zeta(u^t,v^t,U^t)1)\right)$
2. $v^{t+1} = v^t + \log\left(c./(\zeta(u^{t+1},v^t,U^t)^\top 1)\right)$
3. Compute $\pi^{t+1} = \zeta(u^{t+1}, v^{t+1}, U^t)$
4. $V_\pi = \sum_{ij} \pi_{ij}(x_i - y_j)(x_i - y_j)^\top$
5. Euclidean gradient $\nabla_U g = -\frac{2}{\eta} V_{\pi} U^t$
6. Project $\nabla_U g$ to the tangent space to get $\xi^{t+1}$
7. Retract via QR or polar decomposition: $U^{t+1} = \mathrm{Retr}_{U^t}(-\tau \xi^{t+1})$
Stop when stationarity and marginal constraints reach tolerances.

This ensemble yields a scalable algorithm, with per-iteration cost $O(n^2 d k + d k^2 + k^3)$ and total arithmetic complexity $O\left((n^2 d k + d k^2 + k^3)\log n\,\epsilon^{-3}\right)$ to find an $\epsilon$ -stationary point.

4. Convergence Analysis and Complexity

Under mild Lipschitz-smoothness conditions, RBCD ensures sufficient decrease in $g$ , and telescoping yields that an $(\epsilon_1,\epsilon_2)$ -stationary point is obtained in at most

$T = O\left(\log n\,(\epsilon_2^{-3} + \epsilon_1^{-2}\epsilon_2^{-1})\right)$

iterations. Setting $\epsilon_1 = \epsilon_2 = \epsilon$ yields $T = O(\log n\,\epsilon^{-3})$ . This is a significant improvement over the previous RGAS method, which has complexity $O(n^2 \epsilon^{-12})$ .

5. Practical Implementation Considerations

Sample Preparation: Given weighted data $\{x_i\}_{i=1}^n, \{y_j\}_{j=1}^n$ , form empirical measures with weights $r_i, c_j$ (often $1/n$).
Regularization Parameter: Set $\eta \approx \epsilon/(C\log n)$ to keep entropy bias within $O(\epsilon)$ . $\eta \in [0.01,1]$ is typical in practice, often tuned by cross-validation.
Step Size Tuning: Choose $\tau$ in accordance with Lipschitz and retraction constants, as described in the data.
Initialization: Use random or PCA-based $U^0$ , $u^0 = v^0 = 0$ .
Iterative Alternation: Alternate Sinkhorn with Riemannian gradient steps, monitoring dual feasibility and stationarity.

In practice, the adaptive variant RABCD, which incorporates per-coordinate learning rates similar to AdaGrad/Adam, improves empirical runtime by $10\text{–}30\%$ beyond RBCD and $2\text{–}4\times$ speedup over RGAS-type algorithms.

6. Numerical Performance and Empirical Evaluation

Empirical studies on both synthetic and real datasets validate efficiency and accuracy:

Synthetic Fragmented Hypercube ( $d$ up to 500, $n$ up to 1000): RBCD recovers ground-truth PRW $^2=4k^*$ , matching projection quality of RGAS and yielding $2$– $5\times$ faster runtimes for $n\gg d$ .
Gaussian Covariances on Low-rank Subspaces: RBCD maintains projection fidelity under moderate noise, matches RGAS in accuracy, and outperforms in speed.
Real Data: On movie-scripts and Shakespeare plays (word2vec, $d=300\to k=2$ ), RBCD achieves $1.2\times$ speedup over RGAS. On MNIST CNN features ( $d=128\to k=2$ ), RBCD again accelerates computation while preserving projection-distance quality.
Adaptive Variant: RABCD empirically yields further $10$– $30\%$ speedup and $2$– $4\times$ improvement over RGAS/RABCD in large-scale settings.

7. Significance and Applications

WPEs implemented via the (adaptive) Riemannian block coordinate descent represent the first family of algorithms for the projection-robust Wasserstein distance with $O(\epsilon^{-3})$ complexity—substantially improving computational feasibility for large-scale, high-dimensional optimal transport tasks. The formulation is general and encompasses applications in two-sample testing, robust dimensionality reduction, and large-scale data comparison, where identification of maximally separating linear projections under the Wasserstein metric is crucial. The architecture is readily extensible and compatible with GPU acceleration (due to intensive Sinkhorn calls) and can be parallelized over independent random initializations for global optimization. Empirical validation demonstrates that these methods retain the statistical power of projection-based OT distances while rendering computations tractable at scales unattainable by earlier approaches.

Table: Algorithm Comparison and Complexity

Algorithm	Per-iteration cost	Iteration bound	Arithmetic complexity
RBCD	$O(n^2 d k + d k^2 + k^3)$	$O(\log n\,\epsilon^{-3})$	$O((n^2 d k + d k^2 + k^3)\log n\,\epsilon^{-3})$
RGAS	$O(n^2)$	$O(\epsilon^{-12})$	$O(n^2\epsilon^{-12})$

The table highlights the substantial improvement of RBCD over RGAS in terms of polynomial complexity in $\epsilon$ , with empirical speed-ups ranging from $2$ to $5\times$ and robust performance on both synthetic and real high-dimensional datasets (Huang et al., 2020).

PDF Markdown Chat (Pro)

References (1)

A Riemannian Block Coordinate Descent Method for Computing the Projection Robust Wasserstein Distance (2020)

Follow Topic

Get notified by email when new papers are published related to Wasserstein Projection Estimator (WPE).