Data-Driven DRO via Optimal Transport

Updated 6 August 2025

Data-Driven DRO is a robust optimization technique that constructs ambiguity sets from sample data to immunize models against perturbations.
The approach leverages optimal transport discrepancies and metric learning to adaptively regularize models by reflecting the data's discriminative geometry.
Empirical results show improved training and testing performance, evidencing enhanced resilience to noise and outliers in high-dimensional settings.

A data-driven Distributionally Robust Optimization (DRO) approach leverages sample data to construct an ambiguity set—typically a statistical neighborhood of the empirical distribution—such that solutions are immunized against plausible perturbations of the underlying data-generating process. The central technical challenge is designing, calibrating, and optimizing over this neighborhood to balance performance and robustness, particularly in machine learning and statistical estimation tasks where overfitting to noise or outliers can be catastrophic.

1. Formulation: Data-Driven Ambiguity Sets via Optimal Transport

The core DRO formulation considered is

$\min_{\beta} \max_{P \in \mathcal{U}_{\delta}(P_n)} E_P[\, l(X, Y, \beta) \,]$

where $P_n$ is the empirical distribution and the ambiguity set $\mathcal{U}_{\delta}(P_n)$ is defined by an optimal transport discrepancy: $\mathcal{U}_{\delta}(P_n) = \left\{ P : D_c(P, P_n) \leq \delta \right\}$ with

$D_c(P, P_n) = \inf_{\pi \in \mathcal{P}(\operatorname{supp}(P) \times \operatorname{supp}(P_n)),\, \pi_U = P,\, \pi_V = P_n} E_{\pi}[\, c(U, V) \,]$

Here, $c(u, v)$ is the cost associated with transporting mass from $v$ to $u$ . Previous work established that for appropriate cost functions $c(\cdot)$ , classical regularized estimators (such as Lasso, Support Vector Machines, regularized logistic regression) are special cases of the DRO problem, with the regularization parameter $\delta$ interpretable as the radius or "budget" of the ambiguity set.

2. Data-Driven Learning of the Transport Cost: Metric Learning

The main methodological contribution is to learn the transport cost $c$ from data itself, instead of fixing it a priori. For example, for classification or regression problems, a commonly used parametric form for the cost is a Mahalanobis distance: $c_\Lambda((x, y), (x', y')) = d_\Lambda^2(x, x') \cdot I(y=y') + \infty \cdot I(y\neq y')$ where

$d_\Lambda(x, x') = \left( (x - x')^\top \Lambda (x - x') \right)^{1/2},\quad \Lambda \succeq 0$

The matrix $\Lambda$ is estimated by metric learning: using labeled data, one defines sets $\mathcal{M}$ (pairs to be close, labels agree) and $\mathcal{N}$ (pairs to be far, labels differ), and solves: $\min_{\Lambda \in \operatorname{PSD}} \sum_{(x_i, x_j) \in \mathcal{M}} d_\Lambda^2(x_i, x_j) \quad \text{subject to} \quad \sum_{(x_i, x_j) \in \mathcal{N}} d_\Lambda^2(x_i, x_j) \geq \bar{\lambda}$ This ensures that the cost used in subsequent DRO accurately reflects the discriminative structure of the data—nearby samples with identical labels should be close, and samples with different labels should be far apart in the induced metric.

3. Explicit Regularization and Reformulations

Plugging the learned cost $c_\Lambda$ into the DRO, several cases of the loss function allow an explicit reduction of the inner maximization, resulting in adaptive regularization. For linear regression with quadratic loss,

$\min_{\beta} \max_{P : D_{c_\Lambda}(P, P_n) \leq \delta} E_P\left[\, (Y - X^\top \beta)^2 \, \right] = \min_\beta \left( \left[ (1/n) \sum_{i=1}^n (Y_i - X_i^\top \beta)^2 \right]^{1/2} + \sqrt{\delta} \, \|\beta\|_{\Lambda^{-1}} \right)^2$

In the logistic regression case,

$\min_{\beta} \max_{P : D_{c_\Lambda}(P, P_n) \leq \delta} E_P[\, \log(1 + e^{-Y X^\top \beta}) \,] = \min_\beta \frac{1}{n} \sum_{i=1}^n \log(1 + e^{-Y_i X_i^\top \beta}) + \delta \|\beta\|_{\Lambda^{-1}}$

The regularization penalty is thus determined by the learned metric, yielding an adaptive regularization that reflects the local geometry of the data.

4. Computational Strategies: Dual Reformulation and SGD

For general (possibly nonlinear) losses or feature maps $\Phi(\cdot)$ , closed-form characterization of the maximization over $P$ is not available. The authors propose a stochastic optimization scheme:

Initialization: $\beta \leftarrow$ empirical risk minimizer, $\lambda \leftarrow 0$ , small smoothing parameter $\epsilon$ .
Iterative Updates:
- For each batch, sample $L$ points $u_k$ from a reference distribution $f$ (e.g., Gaussian).
- For each data point $(X, Y)$ compute:
$\varphi_{\epsilon,f}(X,Y,\beta,\lambda) = \epsilon \log \int \exp\left( \frac{\psi(u, X, Y, \beta, \lambda)}{\epsilon} \right) f(u)\, du$

where

$\psi(u,X,Y,b,\lambda) = \ell(u,Y,\beta) - \lambda(c(u,X) - \delta)$

- Estimate gradients $\nabla_\beta \varphi_{\epsilon,f}, \nabla_\lambda \varphi_{\epsilon,f}$ and perform a gradient update.

This stochastic smoothing/dual approach exploits the Fenchel duality structure of the DRO objective and allows efficient mini-batch optimization for high-dimensional or nonlinear models.

5. Empirical Performance and Adaptive Regularization

Empirical studies on benchmark datasets (e.g., UCI repository) demonstrate the efficacy of the data-driven DRO approach:

Both linear DRO (DRO-L) and nonlinear DRO (DRO-NL) reduce testing and training loss relative to plain logistic regression (LR) and $L_1$ -regularized logistic regression (LRL1).
Prediction accuracy is consistently improved by DRO methods.
Learning the cost function adaptively focuses the uncertainty set—thus, the regularization acts primarily on directions in parameter space corresponding to high variability or low predictive stability.

This approach yields both theoretical and practical advantages: it provides a direct, interpretable link between probabilistic uncertainty and regularization, and empirical gains in generalization, especially in regimes with complex or high-dimensional data geometry.

6. Implementation Considerations and Limitations

Data requirements: Accurate metric learning requires sufficient labeled side information to discriminate $\mathcal{M}$ and $\mathcal{N}$ sets. In settings with scarce labels, the quality of the learned cost function (and thus robustness) diminishes.
Loss function class: Explicit analytical reformulation is available for certain losses (quadratic, logistic); more general losses require soft-max smoothing and stochastic optimization.
Computational cost: The dual stochastic gradient algorithm is efficient but introduces additional hyperparameters (e.g., smoothing $\epsilon$ , batch size, number of inner samples $L$ ).
Regularization parameter selection: The neighborhood size $\delta$ should be tuned (e.g., via cross-validation) to optimize test performance or selected by statistical criteria based on the hypothesis class and sample size.

7. Connections and Broader Implications

This data-driven DRO framework—with learned optimal transport cost—unifies the interpretations of regularized estimators, optimal transport-based uncertainty sets, and metric learning. The regularization is both adaptive (reflecting learned geometry) and probabilistically interpretable (as a budget for adversarial perturbation):

Estimators correspond to specific choices of cost; adaptive regularization based on learned $\Lambda$ enhances generalization (Blanchet et al., 2017).
The framework allows interpretation of classical and contemporary algorithms (e.g., SVM, Lasso, regularized logistic regression) as instances of DRO.
The methodology can be naturally extended to nonlinear representations (feature maps, kernels), complex output spaces, and more general optimal transport costs—subject to computational tractability via stochastic or dual optimization.

This approach provides a principled, data-dependent pathway for tailoring robustness in modern learning systems, unifying several directions in robust statistics, adversarial machine learning, and regularization theory under the lens of optimal transport-based DRO.

PDF Markdown Chat (Pro)

References (1)

Data-driven Optimal Cost Selection for Distributionally Robust Optimization (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Data-Driven DRO Approach.