Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Wasserstein Projection Estimator (WPE)

Updated 15 November 2025
  • WPE is a projection-based optimal transport technique that leverages the 2-Wasserstein distance to robustly extract lower-dimensional structures from high-dimensional distributions.
  • It employs entropy regularization and Sinkhorn iterations to reformulate the nested optimization, ensuring smooth and computationally efficient solutions.
  • Riemannian block coordinate descent algorithms significantly reduce complexity, making WPE practical for applications in robust statistics, dimensionality reduction, and large-scale data analysis.

The Wasserstein Projection Estimator (WPE) is a statistical and optimization framework that leverages the Wasserstein distance to perform robust, low-dimensional, or constraint-respecting inference on probability distributions, particularly in high-dimensional settings. WPEs have emerged in diverse formulations—from optimally projecting empirical distributions onto parametric or structural sets, to serving as the computational core of dimensionality-reducing robust statistics, to acting as a key ingredient in statistical tests and optimization tasks. The formalization and efficient computation of WPEs is vital both for overcoming the curse of dimensionality and for extending the reach of optimal transport–based methods in contemporary data sciences.

1. Core Definition and Formulation

Let μ,ν\mu,\nu be probability measures on Rd\mathbb{R}^d. For kdk\leq d, let USt(d,k)U\in\mathrm{St}(d,k) be an element of the Stiefel manifold (the set of d×kd\times k real matrices with UU=IkU^\top U=I_k). Define the orthogonal projection PU(x)=UUxP_U(x)=UU^\top x, and denote the push-forward distribution PU#μP_U\#\mu as the law of PU(X)P_U(X) for XμX\sim\mu. The kk-dimensional projection-robust Wasserstein distance (PRW) is

Pk(μ,ν)=maxUSt(d,k)W2(PU#μ,PU#ν)P_k(\mu,\nu) = \max_{U\in\mathrm{St}(d,k)} W_2(P_U\#\mu, P_U\#\nu)

This defines the WPE: it selects the kk-dimensional subspace maximizing the 2-Wasserstein distance between the projected versions of μ\mu and ν\nu. In empirical settings with finite-support measures μn=i=1nriδxi\mu_n = \sum_{i=1}^n r_i\delta_{x_i} and νn=j=1ncjδyj\nu_n = \sum_{j=1}^n c_j\delta_{y_j}, the discrete optimization becomes

Pk2(μn,νn)=maxUSt(d,k)minΠΠ(μn,νn)i,jΠijUxiUyj2P_k^2(\mu_n, \nu_n) = \max_{U\in\mathrm{St}(d,k)} \min_{\Pi\in\Pi(\mu_n,\nu_n)} \sum_{i,j} \Pi_{ij} \|U^\top x_i - U^\top y_j\|^2

This is a max–min problem over matrices UU and transportation plans Π\Pi under marginal constraints Π1=r\Pi 1 = r, Π1=c\Pi^\top 1 = c.

2. Regularization and Computational Reformulation

Direct solution is computationally intensive due to the nested max–min structure. Entropy regularization is introduced: H(Π)=ij(ΠijlogΠijΠij)H(\Pi) = -\sum_{ij} (\Pi_{ij} \log \Pi_{ij} - \Pi_{ij}) yielding the regularized problem

maxUSt(d,k)p(U),p(U)=minΠΠ(μn,νn)i,jΠijUxiUyj2ηH(Π)\max_{U\in\mathrm{St}(d,k)} p(U), \qquad p(U) = \min_{\Pi \in \Pi(\mu_n,\nu_n)} \sum_{i,j} \Pi_{ij}\|U^\top x_i - U^\top y_j\|^2 - \eta H(\Pi)

where η>0\eta > 0 controls the entropic bias. The regularized formulation facilitates smoothness and efficient computation, notably enabling the application of Sinkhorn iteration for the inner OT problem.

A dual reformulation introduces dual variables u,vRnu, v\in\mathbb{R}^n such that

ζ(u,v,U)ij=exp[1ηUxiUyj2+ui+vj]\zeta(u,v,U)_{ij} = \exp\left[-\frac{1}{\eta} \|U^\top x_i - U^\top y_j\|^2 + u_i + v_j \right]

and π(u,v,U)=ζ(u,v,U)/ζ(u,v,U)1\pi(u,v,U) = \zeta(u,v,U)/\|\zeta(u,v,U)\|_1, with the objective function

g(u,v,U)=logζ(u,v,U)1rucvg(u,v,U) = \log\|\zeta(u,v,U)\|_1 - r^\top u - c^\top v

The problem is reframed as minimizing g(u,v,U)g(u,v,U) jointly in (u,v,U)(u,v,U), enabling block coordinate descent strategies with closed-form Sinkhorn updates in uu (rows) and vv (columns).

3. Riemannian Block Coordinate Descent Algorithm

The RBCD algorithm alternates between Sinkhorn updates for dual variables and a Riemannian gradient step for the projection UU, exploiting the geometry of St(d,k)\mathrm{St}(d,k). The pseudocode is:

  • Initialize u0,v0,U0u^0,v^0,U^0.
  • Repeat (for each iteration tt):

    1. ut+1=ut+log(r./(ζ(ut,vt,Ut)1))u^{t+1} = u^t + \log\left(r./(\zeta(u^t,v^t,U^t)1)\right)
    2. vt+1=vt+log(c./(ζ(ut+1,vt,Ut)1))v^{t+1} = v^t + \log\left(c./(\zeta(u^{t+1},v^t,U^t)^\top 1)\right)
    3. Compute πt+1=ζ(ut+1,vt+1,Ut)\pi^{t+1} = \zeta(u^{t+1}, v^{t+1}, U^t)
    4. Vπ=ijπij(xiyj)(xiyj)V_\pi = \sum_{ij} \pi_{ij}(x_i - y_j)(x_i - y_j)^\top
    5. Euclidean gradient Ug=2ηVπUt\nabla_U g = -\frac{2}{\eta} V_{\pi} U^t
    6. Project Ug\nabla_U g to the tangent space to get ξt+1\xi^{t+1}
    7. Retract via QR or polar decomposition: Ut+1=RetrUt(τξt+1)U^{t+1} = \mathrm{Retr}_{U^t}(-\tau \xi^{t+1})
  • Stop when stationarity and marginal constraints reach tolerances.

This ensemble yields a scalable algorithm, with per-iteration cost O(n2dk+dk2+k3)O(n^2 d k + d k^2 + k^3) and total arithmetic complexity O((n2dk+dk2+k3)lognϵ3)O\left((n^2 d k + d k^2 + k^3)\log n\,\epsilon^{-3}\right) to find an ϵ\epsilon-stationary point.

4. Convergence Analysis and Complexity

Under mild Lipschitz-smoothness conditions, RBCD ensures sufficient decrease in gg, and telescoping yields that an (ϵ1,ϵ2)(\epsilon_1,\epsilon_2)-stationary point is obtained in at most

T=O(logn(ϵ23+ϵ12ϵ21))T = O\left(\log n\,(\epsilon_2^{-3} + \epsilon_1^{-2}\epsilon_2^{-1})\right)

iterations. Setting ϵ1=ϵ2=ϵ\epsilon_1 = \epsilon_2 = \epsilon yields T=O(lognϵ3)T = O(\log n\,\epsilon^{-3}). This is a significant improvement over the previous RGAS method, which has complexity O(n2ϵ12)O(n^2 \epsilon^{-12}).

5. Practical Implementation Considerations

  • Sample Preparation: Given weighted data {xi}i=1n,{yj}j=1n\{x_i\}_{i=1}^n, \{y_j\}_{j=1}^n, form empirical measures with weights ri,cjr_i, c_j (often $1/n$).
  • Regularization Parameter: Set ηϵ/(Clogn)\eta \approx \epsilon/(C\log n) to keep entropy bias within O(ϵ)O(\epsilon). η[0.01,1]\eta \in [0.01,1] is typical in practice, often tuned by cross-validation.
  • Step Size Tuning: Choose τ\tau in accordance with Lipschitz and retraction constants, as described in the data.
  • Initialization: Use random or PCA-based U0U^0, u0=v0=0u^0 = v^0 = 0.
  • Iterative Alternation: Alternate Sinkhorn with Riemannian gradient steps, monitoring dual feasibility and stationarity.

In practice, the adaptive variant RABCD, which incorporates per-coordinate learning rates similar to AdaGrad/Adam, improves empirical runtime by 1030%10\text{–}30\% beyond RBCD and 24×2\text{–}4\times speedup over RGAS-type algorithms.

6. Numerical Performance and Empirical Evaluation

Empirical studies on both synthetic and real datasets validate efficiency and accuracy:

  • Synthetic Fragmented Hypercube (dd up to 500, nn up to 1000): RBCD recovers ground-truth PRW2=4k^2=4k^*, matching projection quality of RGAS and yielding $2$–5×5\times faster runtimes for ndn\gg d.
  • Gaussian Covariances on Low-rank Subspaces: RBCD maintains projection fidelity under moderate noise, matches RGAS in accuracy, and outperforms in speed.
  • Real Data: On movie-scripts and Shakespeare plays (word2vec, d=300k=2d=300\to k=2), RBCD achieves 1.2×1.2\times speedup over RGAS. On MNIST CNN features (d=128k=2d=128\to k=2), RBCD again accelerates computation while preserving projection-distance quality.
  • Adaptive Variant: RABCD empirically yields further $10$–30%30\% speedup and $2$–4×4\times improvement over RGAS/RABCD in large-scale settings.

7. Significance and Applications

WPEs implemented via the (adaptive) Riemannian block coordinate descent represent the first family of algorithms for the projection-robust Wasserstein distance with O(ϵ3)O(\epsilon^{-3}) complexity—substantially improving computational feasibility for large-scale, high-dimensional optimal transport tasks. The formulation is general and encompasses applications in two-sample testing, robust dimensionality reduction, and large-scale data comparison, where identification of maximally separating linear projections under the Wasserstein metric is crucial. The architecture is readily extensible and compatible with GPU acceleration (due to intensive Sinkhorn calls) and can be parallelized over independent random initializations for global optimization. Empirical validation demonstrates that these methods retain the statistical power of projection-based OT distances while rendering computations tractable at scales unattainable by earlier approaches.

Table: Algorithm Comparison and Complexity

Algorithm Per-iteration cost Iteration bound Arithmetic complexity
RBCD O(n2dk+dk2+k3)O(n^2 d k + d k^2 + k^3) O(lognϵ3)O(\log n\,\epsilon^{-3}) O((n2dk+dk2+k3)lognϵ3)O((n^2 d k + d k^2 + k^3)\log n\,\epsilon^{-3})
RGAS O(n2)O(n^2) O(ϵ12)O(\epsilon^{-12}) O(n2ϵ12)O(n^2\epsilon^{-12})

The table highlights the substantial improvement of RBCD over RGAS in terms of polynomial complexity in ϵ\epsilon, with empirical speed-ups ranging from $2$ to 5×5\times and robust performance on both synthetic and real high-dimensional datasets (Huang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Wasserstein Projection Estimator (WPE).