Distributionally Robust Importance Sampling

Updated 6 January 2026

DRIS is a framework that integrates distributionally robust optimization with importance sampling to yield statistical guarantees and improved efficiency in risk estimation.
It employs divergence-constrained adversarial weighting to robustly reweight samples against model uncertainty and distribution shifts.
DRIS has practical applications in deep learning, rare-event simulation, and uncertainty quantification, demonstrating improved worst-case performance.

Distributionally Robust Importance Sampling (DRIS) refers to a suite of methodologies that merge Distributionally Robust Optimization (DRO) with importance sampling to provide statistical guarantees and improved efficiency when estimating risk, optimizing loss, or simulating rare events under distributional ambiguity. DRIS addresses the challenge that the true data-generating process is uncertain and potentially subject to distribution shift, by searching for estimators or predictors that perform well across a set of plausible models, rather than under a single baseline law.

1. Theoretical Foundations

DRIS is grounded in the intersection of DRO and importance sampling. In the DRO framework, one replaces a single nominal probability distribution $P_0$ with an ambiguity set $\mathcal{P}$ of possible data-generating distributions. Typical ambiguity sets are $\phi$ -divergence balls (e.g., KL-divergence, $\chi^2$ -divergence) or Wasserstein balls centered at $P_0$ .

The DRIS objective is to estimate

$\sup_{P \in \mathcal{P}} \mathbb{E}_{P}[h(X)]$

or solve robust decision problems such as

$\min_{\theta} \sup_{P \in \mathcal{P}} \mathbb{E}_{P}\left[L(\theta; X)\right] + R(\theta)$

where $R(\theta)$ is a regularizer. The fusion with importance sampling enters through noting that for $P \ll P_0$ , one can write $dP/dP_0 = w(x) \geq 0$ , and thus

$\mathbb{E}_P[h(X)] = \mathbb{E}_{P_0}[h(X) w(X)]$

with $w$ constrained to satisfy $\mathbb{E}_{P_0}[w]=1$ and divergence or moment conditions derived from $\mathcal{P}$ (Bai et al., 2021).

2. DRO and Importance Sampling: Methodological Integration

DRIS algorithms generally proceed by (i) characterizing the optimal adversarial weights $w^*$ via a variational principle (e.g., maximizing weighted loss subject to divergence constraints), and (ii) using these weights as importance sampling ratios to reweight samples for estimation or optimization.

A prototypical DRIS estimator (for risk estimation) takes the form:

$\widehat\mu_{\rm DRIS} = \frac{1}{N} \sum_{i=1}^N h(X_i) w^*(X_i)$

where either $X_i \sim P_0$ or from a proposal $q$ , and $w^*$ arises from the DRO inner maximization, often solved on a sampled support as a linear program (Bai et al., 2021, Bai et al., 2020).

DRO Objective and Parametric Likelihood Ratios

The DRO objective is recast in terms of likelihood ratios $r(x, y)$ :

$\min_\theta \max_{r \in R} \mathbb{E}_{(x, y) \sim p}[r(x, y) \ell_\theta(x, y)]$

where $R = \{ r \geq 0: \mathbb{E}_p[r] = 1, D(rp \Vert p) \leq \kappa \}$ , with $D$ a divergence, typically KL (Michel et al., 2022).

Parametrizations $r_\phi(x, y) \propto \exp(f_\phi(x, y))$ are used, and constraints are enforced with penalties (mini-batch normalization for $\mathbb{E}_p[r]=1$ , Lagrange penalty for KL-ball). Algorithms alternate gradient descent updates in model and adversary parameters, efficiently searching for robust predictors (Michel et al., 2022).

3. DRIS in Deep Learning and Machine Learning

In deep learning, DRIS appears as instance reweighting schemes with theoretically sound origins. “Hardness-weighted sampling” is a notable instantiation, where at each SGD step, examples are sampled according to adversarially-derived weights reflecting their recent losses, i.e.,

$q^*_i = \frac{\exp(\beta \ell_i(\theta))}{\sum_{j=1}^n \exp(\beta \ell_j(\theta))}$

with temperature $\beta$ tuning robustness vs. variance (Fidon et al., 2020). This is a closed-form solution to the DRO dual for KL-divergence.

Gradient estimates are corrected via importance sampling:

$g^{(t)} = \frac{1}{b} \sum_{i \in \mathcal{I}} w_i \nabla_\theta \ell_i(\theta)$

where $w_i$ is the importance weight, optionally clipped for variance control.

The method is computationally scalable and maintains convergence guarantees for over-parameterized neural networks, establishing theoretical and practical viability (Fidon et al., 2020).

4. DRIS for Rare-Event Simulation and Quantification

In the rare-event simulation context with model uncertainty, DRIS addresses the computation of worst-case probabilities under an ambiguity set—most tractably, a Wasserstein ball:

$\mathcal{P} = \{ P : W_2(P, P_0) \leq \delta \}$

Estimation of

$p^* = \sup_{P \in \mathcal{P}} P(X \in A)$

is recast using duality:

$p^* = P_0(d(X, A) \leq u_*)$

with $u_*$ defined implicitly as the root of $h(u) = \mathbb{E}_{P_0}[ d(X,A)^2 1\{ d(X,A) \leq u \} ] = \delta^2$ (Ahn et al., 4 Jan 2026).

DRIS constructs an estimator for $p^*$ by:

Sampling efficiently near the critical set $\{d(X, A) \leq u_*\}$ using a tailored proposal.
Calculating likelihood ratios and the rare-event probability via importance sampling estimators.
Employing root-finding to identify $u_*$ .

A critical property established is “vanishing relative error”: as $p^* \to 0$ , the variance of the estimator grows strictly slower than $(p^*)^2$ , an efficiency property unavailable to standard techniques (Ahn et al., 4 Jan 2026).

5. Algorithmic Procedures and Complexity

DRIS implementations depend on the context but share common procedural elements:

Support generation: draw candidate support points $x_j \sim P_0$ .
DRO/IS calibration: solve for adversarial weights $W^*$ via linear programming or neural parameterization, enforcing normalization and divergence constraints as penalties or hard constraints.
Sampling: draw evaluation points from the baseline or importance-optimized laws.
Importance weighting: compute the weights $w^*$ using the adversarial likelihood ratio or sampled weights.
Estimation or optimization: aggregate weighted losses/statistics.

For linear programs with $k$ support points and $m$ constraints, complexity is $O(k^3 + k^2 m)$ per DRO calibration, with efficient sampling and estimation steps ( $O(N)$ for $N$ IS replicates) (Bai et al., 2021).

Practical schemes (e.g., batch-level normalization in mini-batch SGD) introduce negligible computational overhead compared to standard training pipelines (Michel et al., 2022, Fidon et al., 2020).

6. Statistical and Robustness Guarantees

The statistical validity of DRIS relies on the calibration of ambiguity sets using data-driven procedures such as hypothesis testing—e.g., KS-balls for nonparametric uncertainty quantification (Bai et al., 2020, Bai et al., 2021). Key guarantees include:

Coverage: the ambiguity set contains the true law with at least $1-\alpha$ probability.
Validity: the robust expectation forms an upper confidence bound on the true mean/risk.
Asymptotic normality: the DRIS estimator is asymptotically normal under appropriate conditions.
Efficiency: vanishing relative error in rare-event scaling regimes (Ahn et al., 4 Jan 2026).

For deep learning, DRIS methods empirically yield higher worst-group accuracy (robust accuracy) on subpopulation-shift benchmarks, improve tail performance on rare classes, and maintain stability even under significant label noise (Michel et al., 2022, Fidon et al., 2020).

7. Empirical Performance and Applications

DRIS has been empirically validated across multiple settings:

Classification under subpopulation shift: consistent gains in robust accuracy over ERM, nonparametric DRO, and prior parametric approaches on benchmarks such as BiasedSST, Waterbirds, and CelebA (Michel et al., 2022).
Uncertainty quantification: informative, variance-reduced bounds for failure probabilities and design optimization in the NASA Langley UQ Challenge (Bai et al., 2020, Bai et al., 2021).
Rare-event simulation: orders-of-magnitude improvement in variance and efficiency over naive Monte Carlo and exponential twisting, while offering worst-case guarantees under model ambiguity (Ahn et al., 4 Jan 2026).
Deep learning (medical image segmentation, imbalanced/MNIST): improved performance on minority classes and underrepresented pathologies, with reduction in worst-case error quantiles and improved fairness across acquisition contexts (Fidon et al., 2020).

A notable finding is that the choice of mini-batch normalization or penalty for enforcing probability constraints interacts favorably with proposed stopping criteria and batch-size scaling; for batch sizes $\geq 16$ , the procedure achieves full robustness gains (Michel et al., 2022).

References:

"Distributionally Robust Models with Parametric Likelihood Ratios" (Michel et al., 2022)
"A Distributionally Robust Optimization Approach to the NASA Langley Uncertainty Quantification Challenge" (Bai et al., 2020)
"Model Calibration via Distributionally Robust Optimization: On the NASA Langley Uncertainty Quantification Challenge" (Bai et al., 2021)
"Distributionally Robust Deep Learning using Hardness Weighted Sampling" (Fidon et al., 2020)
"Wasserstein Distributionally Robust Rare-Event Simulation" (Ahn et al., 4 Jan 2026)