Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
Gemini 2.5 Pro Premium
43 tokens/sec
GPT-5 Medium
19 tokens/sec
GPT-5 High Premium
30 tokens/sec
GPT-4o
93 tokens/sec
DeepSeek R1 via Azure Premium
88 tokens/sec
GPT OSS 120B via Groq Premium
441 tokens/sec
Kimi K2 via Groq Premium
234 tokens/sec
2000 character limit reached

Data-Driven DRO via Optimal Transport

Updated 6 August 2025
  • Data-Driven DRO is a robust optimization technique that constructs ambiguity sets from sample data to immunize models against perturbations.
  • The approach leverages optimal transport discrepancies and metric learning to adaptively regularize models by reflecting the data's discriminative geometry.
  • Empirical results show improved training and testing performance, evidencing enhanced resilience to noise and outliers in high-dimensional settings.

A data-driven Distributionally Robust Optimization (DRO) approach leverages sample data to construct an ambiguity set—typically a statistical neighborhood of the empirical distribution—such that solutions are immunized against plausible perturbations of the underlying data-generating process. The central technical challenge is designing, calibrating, and optimizing over this neighborhood to balance performance and robustness, particularly in machine learning and statistical estimation tasks where overfitting to noise or outliers can be catastrophic.

1. Formulation: Data-Driven Ambiguity Sets via Optimal Transport

The core DRO formulation considered is

minβmaxPUδ(Pn)EP[l(X,Y,β)]\min_{\beta} \max_{P \in \mathcal{U}_{\delta}(P_n)} E_P[\, l(X, Y, \beta) \,]

where PnP_n is the empirical distribution and the ambiguity set Uδ(Pn)\mathcal{U}_{\delta}(P_n) is defined by an optimal transport discrepancy: Uδ(Pn)={P:Dc(P,Pn)δ}\mathcal{U}_{\delta}(P_n) = \left\{ P : D_c(P, P_n) \leq \delta \right\} with

Dc(P,Pn)=infπP(supp(P)×supp(Pn)),πU=P,πV=PnEπ[c(U,V)]D_c(P, P_n) = \inf_{\pi \in \mathcal{P}(\operatorname{supp}(P) \times \operatorname{supp}(P_n)),\, \pi_U = P,\, \pi_V = P_n} E_{\pi}[\, c(U, V) \,]

Here, c(u,v)c(u, v) is the cost associated with transporting mass from vv to uu. Previous work established that for appropriate cost functions c()c(\cdot), classical regularized estimators (such as Lasso, Support Vector Machines, regularized logistic regression) are special cases of the DRO problem, with the regularization parameter δ\delta interpretable as the radius or "budget" of the ambiguity set.

2. Data-Driven Learning of the Transport Cost: Metric Learning

The main methodological contribution is to learn the transport cost cc from data itself, instead of fixing it a priori. For example, for classification or regression problems, a commonly used parametric form for the cost is a Mahalanobis distance: cΛ((x,y),(x,y))=dΛ2(x,x)I(y=y)+I(yy)c_\Lambda((x, y), (x', y')) = d_\Lambda^2(x, x') \cdot I(y=y') + \infty \cdot I(y\neq y') where

dΛ(x,x)=((xx)Λ(xx))1/2,Λ0d_\Lambda(x, x') = \left( (x - x')^\top \Lambda (x - x') \right)^{1/2},\quad \Lambda \succeq 0

The matrix Λ\Lambda is estimated by metric learning: using labeled data, one defines sets M\mathcal{M} (pairs to be close, labels agree) and N\mathcal{N} (pairs to be far, labels differ), and solves: minΛPSD(xi,xj)MdΛ2(xi,xj)subject to(xi,xj)NdΛ2(xi,xj)λˉ\min_{\Lambda \in \operatorname{PSD}} \sum_{(x_i, x_j) \in \mathcal{M}} d_\Lambda^2(x_i, x_j) \quad \text{subject to} \quad \sum_{(x_i, x_j) \in \mathcal{N}} d_\Lambda^2(x_i, x_j) \geq \bar{\lambda} This ensures that the cost used in subsequent DRO accurately reflects the discriminative structure of the data—nearby samples with identical labels should be close, and samples with different labels should be far apart in the induced metric.

3. Explicit Regularization and Reformulations

Plugging the learned cost cΛc_\Lambda into the DRO, several cases of the loss function allow an explicit reduction of the inner maximization, resulting in adaptive regularization. For linear regression with quadratic loss,

minβmaxP:DcΛ(P,Pn)δEP[(YXβ)2]=minβ([(1/n)i=1n(YiXiβ)2]1/2+δβΛ1)2\min_{\beta} \max_{P : D_{c_\Lambda}(P, P_n) \leq \delta} E_P\left[\, (Y - X^\top \beta)^2 \, \right] = \min_\beta \left( \left[ (1/n) \sum_{i=1}^n (Y_i - X_i^\top \beta)^2 \right]^{1/2} + \sqrt{\delta} \, \|\beta\|_{\Lambda^{-1}} \right)^2

In the logistic regression case,

minβmaxP:DcΛ(P,Pn)δEP[log(1+eYXβ)]=minβ1ni=1nlog(1+eYiXiβ)+δβΛ1\min_{\beta} \max_{P : D_{c_\Lambda}(P, P_n) \leq \delta} E_P[\, \log(1 + e^{-Y X^\top \beta}) \,] = \min_\beta \frac{1}{n} \sum_{i=1}^n \log(1 + e^{-Y_i X_i^\top \beta}) + \delta \|\beta\|_{\Lambda^{-1}}

The regularization penalty is thus determined by the learned metric, yielding an adaptive regularization that reflects the local geometry of the data.

4. Computational Strategies: Dual Reformulation and SGD

For general (possibly nonlinear) losses or feature maps Φ()\Phi(\cdot), closed-form characterization of the maximization over PP is not available. The authors propose a stochastic optimization scheme:

  1. Initialization: β\beta \leftarrow empirical risk minimizer, λ0\lambda \leftarrow 0, small smoothing parameter ϵ\epsilon.
  2. Iterative Updates:

    • For each batch, sample LL points uku_k from a reference distribution ff (e.g., Gaussian).
    • For each data point (X,Y)(X, Y) compute:

    φϵ,f(X,Y,β,λ)=ϵlogexp(ψ(u,X,Y,β,λ)ϵ)f(u)du\varphi_{\epsilon,f}(X,Y,\beta,\lambda) = \epsilon \log \int \exp\left( \frac{\psi(u, X, Y, \beta, \lambda)}{\epsilon} \right) f(u)\, du

    where

    ψ(u,X,Y,b,λ)=(u,Y,β)λ(c(u,X)δ)\psi(u,X,Y,b,\lambda) = \ell(u,Y,\beta) - \lambda(c(u,X) - \delta)

- Estimate gradients βφϵ,f,λφϵ,f\nabla_\beta \varphi_{\epsilon,f}, \nabla_\lambda \varphi_{\epsilon,f} and perform a gradient update.

This stochastic smoothing/dual approach exploits the Fenchel duality structure of the DRO objective and allows efficient mini-batch optimization for high-dimensional or nonlinear models.

5. Empirical Performance and Adaptive Regularization

Empirical studies on benchmark datasets (e.g., UCI repository) demonstrate the efficacy of the data-driven DRO approach:

  • Both linear DRO (DRO-L) and nonlinear DRO (DRO-NL) reduce testing and training loss relative to plain logistic regression (LR) and L1L_1-regularized logistic regression (LRL1).
  • Prediction accuracy is consistently improved by DRO methods.
  • Learning the cost function adaptively focuses the uncertainty set—thus, the regularization acts primarily on directions in parameter space corresponding to high variability or low predictive stability.

This approach yields both theoretical and practical advantages: it provides a direct, interpretable link between probabilistic uncertainty and regularization, and empirical gains in generalization, especially in regimes with complex or high-dimensional data geometry.

6. Implementation Considerations and Limitations

  • Data requirements: Accurate metric learning requires sufficient labeled side information to discriminate M\mathcal{M} and N\mathcal{N} sets. In settings with scarce labels, the quality of the learned cost function (and thus robustness) diminishes.
  • Loss function class: Explicit analytical reformulation is available for certain losses (quadratic, logistic); more general losses require soft-max smoothing and stochastic optimization.
  • Computational cost: The dual stochastic gradient algorithm is efficient but introduces additional hyperparameters (e.g., smoothing ϵ\epsilon, batch size, number of inner samples LL).
  • Regularization parameter selection: The neighborhood size δ\delta should be tuned (e.g., via cross-validation) to optimize test performance or selected by statistical criteria based on the hypothesis class and sample size.

7. Connections and Broader Implications

This data-driven DRO framework—with learned optimal transport cost—unifies the interpretations of regularized estimators, optimal transport-based uncertainty sets, and metric learning. The regularization is both adaptive (reflecting learned geometry) and probabilistically interpretable (as a budget for adversarial perturbation):

  • Estimators correspond to specific choices of cost; adaptive regularization based on learned Λ\Lambda enhances generalization (Blanchet et al., 2017).
  • The framework allows interpretation of classical and contemporary algorithms (e.g., SVM, Lasso, regularized logistic regression) as instances of DRO.
  • The methodology can be naturally extended to nonlinear representations (feature maps, kernels), complex output spaces, and more general optimal transport costs—subject to computational tractability via stochastic or dual optimization.

This approach provides a principled, data-dependent pathway for tailoring robustness in modern learning systems, unifying several directions in robust statistics, adversarial machine learning, and regularization theory under the lens of optimal transport-based DRO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)