Wasserstein Regularization in ML

Updated 22 May 2026

Wasserstein Regularization is a method that integrates optimal transport metrics to impose geometric structure and robustness on learned models.
It employs dual formulations, entropic smoothing, and gradient penalties to achieve computational efficiency and stable optimization.
The approach finds practical applications in generative modeling, robust regression, and reinforcement learning, offering finite-sample guarantees and improved performance.

Wasserstein regularization refers to a family of regularization techniques that incorporate optimal transport metrics—primarily the Wasserstein (Earth-Mover) distance—into variational objectives in statistics, machine learning, and related applied mathematics fields. Its principal role is to impose geometric structure, robustness, or smoothness on learned models, estimators, or embeddings by leveraging the metric properties of Wasserstein distances between distributions, empirical measures, or prediction outputs. The wide applicability of Wasserstein-based regularization spans generative modeling, distributionally robust optimization, sparse multi-task regression, optimal transport estimation, and other statistical inference domains.

1. Mathematical Formulations of Wasserstein Regularization

The core mathematical object is the $p$ -Wasserstein distance between probability distributions on a metric space $(\mathcal{X}, d)$ , defined for $P, Q \in \mathcal{P}(\mathcal{X})$ as

$W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$

where $\Pi(P, Q)$ is the set of couplings with $P$ and $Q$ as marginals. For $p=1,2$ , $d(\cdot,\cdot)$ typically encodes meaningful geometry (e.g., Euclidean or semantic/cost structure).

Wasserstein regularization arises in multiple guises:

Direct regularization of functionals: Adding a penalty involving Wasserstein distances to model loss functions, e.g., $\mathcal{L}(\theta) + \lambda W_p(P_{\theta}, Q)$ for a parametric distribution $(\mathcal{X}, d)$ 0.
Distributionally robust optimization (DRO): Replacing empirical risk $(\mathcal{X}, d)$ 1 by a worst-case expectation over a Wasserstein ball, i.e., $(\mathcal{X}, d)$ 2 (Wu et al., 2022, Gao et al., 2017, Farokhi, 2020).
Gradient-based penalties: Explicitly penalizing the gradient of a function relative to the input, e.g., enforcing or relaxing the Lipschitz constraint in WGAN critics by penalizing deviations of $(\mathcal{X}, d)$ 3 from $(\mathcal{X}, d)$ 4 (Petzka et al., 2017).
Adversarial and interpolation-based regularization: Regularizing classifiers by maximizing the Wasserstein distance between outputs for adversarially perturbed or interpolated inputs (Fatras et al., 2019, Lin et al., 2019).

2. Algorithmic Implementations and Duality

Efficient algorithmic realizations rely critically on the development of tractable relaxations and dual formulations:

Entropic regularization (Sinkhorn): Adding entropy (Kullback-Leibler divergence) to optimal transport yields a smoothed, strictly convex problem solvable via Sinkhorn scaling, with objective

$(\mathcal{X}, d)$ 5

(Bigot et al., 2022, Quang, 2020, Ballu et al., 2020). This is pivotal for large-scale applications and enables automatic differentiation.

Fenchel and strong duality: Regularized Wasserstein objectives admit explicit dual formulations (e.g., via Fenchel-Legendre conjugacy) that reduce computation to (semi-)infinite convex programs in potential functions; for entropic penalization, duals collapse to log-sum-exp kernels (Azizian et al., 2022).
Stochastic optimization: Unbiased stochastic gradients can be obtained directly from samples using dual representations, enabling scalable estimators and barycenter computations (e.g., sublinear per-iteration cost in dimension) (Ballu et al., 2020).
Gradient Penalties: In WGANs, incorporating one-sided or two-sided gradient norms on the critic network translates the hard Lipschitz constraint of optimal transport duality into smooth, trainable losses (Petzka et al., 2017).

3. Statistical and Robustness Properties

Wasserstein regularization introduces a data-geometry-informed penalty that controls sensitivity to distributional perturbations and enforces robustness:

Regularization equivalence: Under appropriate loss functions, Wasserstein DRO objectives are exactly equivalent to empirical risk minimization with an explicit norm penalty (ridge, lasso, or other): e.g.,

$(\mathcal{X}, d)$ 6

where the dual norm $(\mathcal{X}, d)$ 7 arises from the ground cost (Wu et al., 2022, Farokhi, 2020).

Generalization guarantees: Non-asymptotic high-dimensional generalization bounds decouple dimension dependence, with rates $(\mathcal{X}, d)$ 8 or better for judicious choices of radius and regularization (Wu et al., 2022).
Variation regularization: The induced penalty can generalize total variation, Lipschitz, and $(\mathcal{X}, d)$ 9-gradient regularizers, capturing both local and global sensitivity of models to input changes, even on non-Euclidean spaces (Gao et al., 2017). The formal Wasserstein regularizer,

$P, Q \in \mathcal{P}(\mathcal{X})$ 0

admits a leading-order expansion as $P, Q \in \mathcal{P}(\mathcal{X})$ 1, where $P, Q \in \mathcal{P}(\mathcal{X})$ 2 denotes a metric slope or variation functional.

Distributional robustness: The worst-case expectation over the Wasserstein ambiguity set upper-bounds expected loss under data poisoning or adversarial perturbations, with the penalty term precisely controlling sensitivity quantified by Lipschitz constants (Farokhi, 2020).

4. Applications in Machine Learning and Statistics

Wasserstein regularization has been systematically deployed in diverse empirical settings:

Generative adversarial networks (GANs): Regularization alternatives to weight clipping for enforcing the Lipschitz property on the critic in WGANs, leading to improved convergence and sample quality—WGAN-GP and WGAN-LP variants (Petzka et al., 2017).
Offline reinforcement learning: Policy regularization using the squared Wasserstein distance between policy distributions (often via learned optimal transport maps or input-convex neural networks), conferring enhanced stability over $P, Q \in \mathcal{P}(\mathcal{X})$ 3-divergence regularizers (Omura et al., 14 Jul 2025).
Sparse high-dimensional regression: Regularizers based on unbalanced entropic optimal transport between task-specific coefficient supports, accounting for geometric information in spatially structured regression (Janati et al., 2018).
Robust autoencoders and deep embedding: Wasserstein distances on latent codes (e.g., for graph autoencoders) enforce distributional matching with richer gradient structure than Kullback-Leibler divergence (Liang et al., 2021).
Robust classification with noisy labels: Class-geometry-aware Wasserstein adversarial regularizers outperform isotropic regularizers in settings with semantic or structured label noise (Fatras et al., 2019).
Dictionary learning in Wasserstein space: Sparse geometric regularizers based on Wasserstein distances between atoms and data measures promote local, interpretable, and unique encoding (Mueller et al., 2022).

5. Entropic and Fisher-Type Regularization

Entropic regularization is central to algorithmic tractability and statistical properties:

Computational acceleration: Entropic smoothing enables the use of fast Sinkhorn solvers, reducing algorithmic complexity from $P, Q \in \mathcal{P}(\mathcal{X})$ 4 to $P, Q \in \mathcal{P}(\mathcal{X})$ 5 for Wasserstein estimators at controlled statistical risk, particularly in moderate to high dimension (Bigot et al., 2022).
Smoothing and bias-variance tradeoff: Bias introduced by entropic penalties is $P, Q \in \mathcal{P}(\mathcal{X})$ 6 and can be tuned to be below statistical error $P, Q \in \mathcal{P}(\mathcal{X})$ 7 (Bigot et al., 2022, Quang, 2020).
Diffusion-type regularization: Penalizing the Fisher information (as in Schrödinger bridges) further controls concentration and smoothness in Wasserstein gradient flows, improving convexity and ensuring positivity of solutions (Li et al., 2019, Lin et al., 2019).

6. Practical Considerations and Hyperparameter Selection

Implementation of Wasserstein regularization requires careful attention to penalty tuning, numerics, and geometric encoding:

Hyperparameter regimes: Regularization parameters (e.g., entropic $P, Q \in \mathcal{P}(\mathcal{X})$ 8, radius $P, Q \in \mathcal{P}(\mathcal{X})$ 9, penalty weights $W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$ 0, and mass-relaxations $W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$ 1) must balance approximation quality, computational tractability, and sample efficiency (Petzka et al., 2017, Bigot et al., 2022, Janati et al., 2018).
Entropy-vs-robustness tradeoff: In convex reformulations (e.g., robust market making), entropy regularization smooths stochastic policies while the Wasserstein radius directly controls robustness to model misspecification (Fang et al., 6 Mar 2025).
Ground metric design: The choice of cost matrix $W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$ 2 (e.g., embedding distance for tokens, regressor geometry, semantic similarity) determines the geometric selectivity and effectiveness of regularization (Fatras et al., 2019, Mueller et al., 2022).
Stochastic / parallel implementation: Sinkhorn iterations, stochastic dual ascent, and GPU batch processing enable practical scaling to high-dimensional or large-sample regimes (Ballu et al., 2020, Janati et al., 2018).

7. Theoretical and Practical Implications

Wasserstein regularization constitutes a flexible and principled toolkit for modern statistical learning:

Unified view of regularization: It generalizes and subsumes classical $W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$ 3, $W_p^p(P, Q) = \inf_{\pi \in \Pi(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y),$ 4, total variation, and gradient penalties, adapting naturally to data geometry and model architecture (Gao et al., 2017, Wu et al., 2022).
Model interpretability and identifiability: In dictionary and regression contexts, Wasserstein-based penalties yield sparse, localized, and interpretable representations, and can often resolve non-uniqueness via locality-promoting mechanisms (Mueller et al., 2022, Janati et al., 2018).
Adversarial and distributional robustness: Wasserstein regularization guarantees finite-sample optimality bounds, breaks the curse of dimensionality in certain settings, and confers robustness under data shift, label noise, or adversarial attacks (Wu et al., 2022, Farokhi, 2020, Fatras et al., 2019).
Algorithmic tractability: Entropic and stochastic approximations render large-scale Wasserstein-regularized objectives computationally manageable without sacrificing statistical guarantees (Bigot et al., 2022, Ballu et al., 2020).

In summary, Wasserstein regularization anchors a spectrum of robust and geometrically-aware regularization methodologies, linked by the fundamental properties of optimal transport and the core mathematical structure of the Wasserstein metric. Its integration into learning and inference pipelines is enabled by duality, entropic relaxation, and scalable optimization strategies; its statistical efficacy is undergirded by finite-sample guarantees and deep connections to classical regularization theory. This foundational role has led to widespread adoption across generative modeling, robust statistics, high-dimensional regression, adversarial learning, and beyond.