Wasserstein Ambiguity Sets in Robust Optimization

Updated 15 August 2025

Wasserstein ambiguity sets are defined as the collection of distributions within a specified Wasserstein distance from a reference empirical measure, ensuring model robustness.
They underpin distributionally robust optimization by framing a local minimax risk that hedges against worst-case distributions and addresses model uncertainty.
These sets enable computationally tractable dual reformulations with strong statistical guarantees via concentration inequalities and geometric insights.

A Wasserstein ambiguity set is a collection of probability distributions defined as a ball—according to the Wasserstein (optimal transport) metric—centered at a reference distribution, typically the empirical distribution obtained from observed data. This concept plays a central role in distributionally robust optimization (DRO) and learning theory, where the goal is to hedge against worst-case distributions close to the empirical or nominal model. By using Wasserstein balls as ambiguity sets, one can systematically address model uncertainty, nonparametric robustness, and distributional drift with strong statistical guarantees.

1. Mathematical Formulation and Definition

Given a measurable Polish space $\mathcal{Z}$ with metric $d_{\mathcal{Z}}$ , the $p$ -Wasserstein distance between two probability measures $P, Q$ with finite $p$ -th moments is

$W_p(P, Q) = \inf_{M} \left( \mathbb{E}_{M}\left[ d_{\mathcal{Z}}^p(Z, Z') \right] \right)^{1/p}$

where the infimum is over all couplings $M$ with marginals $P$ and $Q$ .

The Wasserstein ambiguity set of radius $\rho \geq 0$ (order $p$ ) centered at $P$ is

$\mathcal{A}(P) = B^W_{\rho,p}(P) := \{ Q \in \mathcal{P}_p(\mathcal{Z}) : W_p(P, Q) \leq \rho \}$

as introduced in (Lee et al., 2017). Here, $\mathcal{P}_p(\mathcal{Z})$ denotes the set of probability measures with finite $p$ -th moment.

Variants include:

Local balls around empirical distributions for statistical learning and empirical risk minimization
Mixtures of local Wasserstein balls when data is distributed among multiple sites (Ibrahim et al., 4 Oct 2024)
Cluster-based Minkowski sums of local Wasserstein balls for nonparametric Bayesian modeling (Ning et al., 2023)
Structured multi-dimensional rectangles via independent optimal transport constraints on problem subcomponents (Chaouach et al., 9 Apr 2025)

2. Role in Distributionally Robust Optimization and Learning

In DRO, Wasserstein ambiguity sets enable a minimax formulation, where the decision or predictor is chosen to minimize the worst-case expected loss over all distributions within the ambiguity set: $\inf_{f \in \mathcal{F}} \sup_{Q \in B^W_{\rho,p}(P)} \mathbb{E}_Q[f(Z)]$ This "local minimax risk" (Lee et al., 2017) provides protection against distributional misspecification and sample uncertainty by considering all distributions within a controlled, measure-theoretic neighborhood of the data-generating process.

The construction reflects three essential properties:

Data-driven: The ambiguity set is centered at the empirical distribution, exploiting observed data only.
Geometric: Unlike $f$ -divergence–based sets (e.g., KL or $\chi^2$ ), the Wasserstein distance captures the geometry (cost of moving probability mass) and allows both discrete and continuous candidate distributions (Guo et al., 2017).
Nonparametric: No explicit model assumptions are imposed beyond those enforced by the metric and moment conditions.

3. Statistical Guarantees and Concentration

The radius $\rho$ of the Wasserstein ambiguity set is often chosen based on non-asymptotic concentration results: $\mathbb{P}\left( W_p(\widehat{P}_n, P) > r \right) \leq C_1 \exp(-C_2 n r^{d/p})$ where $\widehat{P}_n$ is the empirical distribution, $n$ is sample size, and $C_1, C_2$ depend on the metric and dimensionality (Lee et al., 2017, Boskos et al., 2019, Boskos et al., 2021). This justifies data-dependent selection of $\rho$ to ensure that, with specified high probability, the true distribution is contained in the ambiguity set. In high-dimensional or complex scenarios, the ambiguity set's statistical coverage can be further improved by exploiting structure, e.g., component-wise independence (Chaouach et al., 9 Apr 2025).

4. Computational Tractability and Reformulation

A central property is that, for broad classes of loss functions $f$ (including Lipschitz, piecewise linear, convex, and risk measures such as CVaR), the inner supremum over $Q \in B^W_{\rho,p}(P)$ is dualizable or amenable to convex reformulation (Lee et al., 2017, Hota et al., 2018, Guo et al., 2017, Jackiewicz et al., 2023):

For Lipschitz $f$ , strong duality yields:

$\mathbb{E}_Q[f(Z)] \leq \mathbb{E}_P[f(Z)] + L \rho$

where $L$ is the Lipschitz constant.

For robust chance-constrained and CVaR-regularized programs, the worst-case risk can often be rewritten as (dual variable) regularized or penalized convex problems (e.g., via infimal convolutions), enabling finite-dimensional (sometimes mixed-integer) optimization with established complexity (Hota et al., 2018, Ho-Nguyen et al., 2020, Jackiewicz et al., 2023).

Advances include:

Cutting-plane and stochastic approximation methods for semi-infinite or submodular set programs (Shen et al., 2020)
Column and row generation for combinatorial optimization with high-dimensional supports (Jackiewicz et al., 2023)
Product measure–based decompositions and clustering for scaling to large multi-dimensional uncertainties (Chaouach et al., 9 Apr 2025)

5. Applications Across Domains

Wasserstein ambiguity sets have been applied in a diverse array of modern settings:

Statistical learning robustness: Empirical risk minimization under Wasserstein balls yields improved out-of-sample risk guarantees, selects robust hypotheses, and effectively addresses overfitting via complexity regularization tied to covering numbers of function classes (Lee et al., 2017).
Distributionally robust control: For nonlinear and linear systems, ambiguity propagation through system dynamics facilitates robust feedback synthesis, density steering, and constraint satisfaction with quantifiable risk, as in robust Kalman filtering (Shafieezadeh-Abadeh et al., 2018), MPC (Zhong et al., 2023), and LTI density control (Pilipovsky et al., 19 Mar 2024).
Chance-constrained and combinatorial optimization: Wasserstein ambiguity sets enable precise control of probabilistic constraint satisfaction (e.g., in ground holding, surgery assignment, set covering, and combinatorial cost minimization), often via tractable conic or MILP reformulations and with strong out-of-sample feasibility guarantees (Shehadeh, 2021, Ho-Nguyen et al., 2020, Shen et al., 2020, Wu et al., 2023, Jackiewicz et al., 2023).
Federated and distributed learning: Mixtures of Wasserstein balls facilitate robust federated support vector machine training, allowing local data heterogeneity, privacy, and explicit regularization for feature and label noise (Ibrahim et al., 4 Oct 2024).
Nonparametric-prescriptive and Bayesian settings: Wasserstein ambiguity sets can be integrated with clustering (e.g., DPMMs) to yield global-local robust DRO, enhancing reliability and reducing conservatism of prescriptive analytics in energy systems and finance (Wang et al., 2021, Ning et al., 2023).

6. Theoretical Comparisons and Extensions

Compared with $f$ -divergence–based ambiguity sets, Wasserstein sets are generally less conservative and admit more flexible geometric modeling:

Wasserstein balls remain finite between continuous and empirical (atomic) measures, whereas KL divergence often diverges (Guo et al., 2017).
The Wasserstein-Bregman divergence unifies geometry sensitivity with convexity and asymmetry, interpolating between pure transport and information-theoretic models (Guo et al., 2017).
Structured optimal transport (multi-transport hyperrectangles) enables sharper coverage and reduced conservatism by matching the true dependency structure of uncertainty, at the cost of increased complexity that must be addressed by clustering and dimensionality reduction (Chaouach et al., 9 Apr 2025).

Convergence and statistical guarantees are robust: as the sample size increases, the optimal value and optimizer for robust risk- and chance-constrained programs under Wasserstein ambiguity sets converge to their true (non-robust) stochastic counterparts under technical continuity and regularity assumptions (Cherukuri et al., 2020).

7. Illustrative Example and Interpretation

A canonical example (Lee et al., 2017) considers ERM with hypotheses $f_0$ and $f_1$ such that $f_1$ incurs catastrophic loss in regions not covered by training data. Minimizing empirical risk alone prefers $f_1$ , but with a Wasserstein ambiguity set, the worst-case risk "inflates" $f_1$ 's value, leading the distributionally robust learner to favor $f_0$ . This demonstrates the exploration budget and adversarial mass-transport interpretation: the method penalizes non-robust solutions by accounting for plausible but previously unseen perturbations near the empirical distribution's support.

This conceptual and technical foundation makes Wasserstein ambiguity sets a flexible, statistically principled, and computationally tractable approach to distributional robustness in inference, optimization, and control. Their adoption is supported by deep theoretical results and successful real-world applications across the data sciences, operations research, and engineering.