Prototype-Guided DRO for Robust Few-Shot Learning

Updated 8 February 2026

The paper introduces PG-DRO that constructs class-adaptive priors using hierarchical optimal transport to align few-shot supports with base prototypes.
Methodology integrates Sinkhorn DRO with Gaussian-mixture priors, enabling efficient and robust learning under distribution shifts.
Empirical results reveal improvements of 3–7% in average accuracy and enhanced worst-case performance across challenging few-shot and cross-domain tasks.

Prototype-Guided Distributionally Robust Optimization (PG-DRO) is a framework developed to address the challenge of robust generalization in few-shot learning under distribution shifts. Standard Sinkhorn Distributionally Robust Optimization (DRO) relies on a single, fixed reference distribution, which can limit adaptability, especially when information from classes is scarce and subject to shift. PG-DRO integrates hierarchical optimal transport (OT) to learn class-adaptive priors, termed "prototypes," from base data and incorporates them into a multi-prior Sinkhorn DRO formulation for efficient, theoretically grounded, class-specific robust optimization (Sun et al., 1 Feb 2026).

1. Construction of Class-Adaptive Priors via Hierarchical Optimal Transport

PG-DRO begins by leveraging a rich base dataset with $B$ classes, each represented by empirical means $\mu_b$ and covariances $\Sigma_b$ . These serve as base prototypes. Given a few-shot support set $S_k$ from a novel class $c$ consisting of $N$ labeled instances $(\tilde x_n,\ell_n)$ , the framework aligns each support example to the base classes through a two-level OT process:

Sample-to-prototype soft-min cost: For each support example $\tilde x_n$ and base class $b$ , compute the entropic soft-min cost:

$C_{b n} = -\varepsilon_s \log \sum_{i=1}^{m_b} \exp(-c(\tilde x_n, x_{b,i})/\varepsilon_s)$

where $c(\cdot,\cdot)$ is the ground cost (e.g., Euclidean distance) and $\varepsilon_s$ is a small sample-level temperature.

Class-level entropic OT: Solve the entropic regularized OT problem to obtain a coupling $T^*$ :

$T^* = \arg\min_{T \geq 0} \langle C, T \rangle + \varepsilon_c H(T)$

subject to $T 1_N = \alpha$ (uniform over $B$ ), $T^T 1_B = \beta$ (empirical class-support counts), and $H(T) = -\sum_{b,n} T_{bn} \log T_{bn}$ ; $\varepsilon_c$ is the class-level entropic regularization.

Prototype weights: For each target class $c$ , mixture weights $\tilde w_{b c}$ are obtained by marginalizing $T^*$ over support points with label $c$ :

$\tilde w_{b c} = \frac{\sum_{n: \ell_n=c} T^*_{b n}}{ \sum_{b'} \sum_{n: \ell_n=c} T^*_{b'n} }$

Gaussian-mixture prior: These mixture weights define the class-adaptive prior:

$\nu_c(x) = \sum_{b=1}^B \tilde w_{b c} \mathcal{N}(x; \mu_b, \Sigma_b)$

which places highest mass on base prototypes most aligned to the few-shot support for class $c$ .

2. Integration with Sinkhorn Distributionally Robust Optimization

Having constructed class-adaptive priors $\nu_c$ , PG-DRO embeds them into the Sinkhorn DRO dual. The DRO objective considers distributions $P$ within an entropic-Wasserstein ball $W_\varepsilon(P, \hat\mu) \leq \rho$ centered at the empirical data $\hat\mu$ :

$\min_\theta \sup_{P: W_\varepsilon(P, \hat\mu) \leq \rho} \mathbb{E}_{(x,y) \sim P} \ell_\theta(x, y)$

Using duality, this reduces to a convex minimization for each class $c$ :

$V_c(x; \theta) = \min_{\lambda_c \geq 0} \left\{ \lambda_c \rho + \lambda_c \varepsilon \log \mathbb{E}_{y \sim Q^{\nu_c}_{\varepsilon,x}} \left[ \exp\left( \frac{f_c(y;\theta)}{\lambda_c \varepsilon} \right) \right] \right\}$

where $Q^{\nu_c}_{\varepsilon, x}$ is the Gibbs kernel:

$d Q^{\nu_c}_{\varepsilon,x}(y) = \frac{ \exp( -c(x,y)/\varepsilon ) d\nu_c(y) }{ Z_{\nu_c}(x) }$

and $Z_{\nu_c}(x)$ is the partition function.

Prediction uses robust softmax:

$\hat h(x) = \arg\max_c V_c(x; \theta)$

and the cross-entropy loss over robust class-scores $V_c$ :

$L_{CE}(x, c^*) = -\log \frac{ \exp[V_{c^*}(x; \theta)] }{ \sum_c \exp[V_c(x; \theta)] }$

3. Theoretical Guarantees

PG-DRO establishes several theoretical properties:

Convexity and Uniqueness: For any fixed prior $\nu$ , the DRO dual objective $\phi(\lambda) = \lambda \rho + \lambda \varepsilon \mathbb{E}_x \log \mathbb{E}_{y \sim Q^\nu_{\varepsilon,x}} [\exp(\ell / (\lambda\varepsilon))]$ is strictly convex in $\lambda > 0$ and has a unique minimizer. By extension, the sum over classes $\sum_{c=1}^C \phi_c(\lambda_c)$ retains convexity and uniqueness for each $\lambda_c$ .
Contraction Property under Adaptive OT: Denoting the mixture prior at iteration $t$ as $\nu^{(t)}$ and the population fixed-point as $\nu^*$ , the deviation $\Delta_t(\theta) = |V_{\nu^{(t)}}(\theta) - V_{\nu^*}(\theta)|$ contracts linearly under a locally contractive OT map and sufficiently accurate support count estimates:

$\Delta_{t+1} \leq (1 - \eta_t \kappa) \Delta_t + O(N^{-1/2})$

implying convergence as $t \to \infty$ and $N \to \infty$ .

Consistency: Under continuity and Lipschitz conditions, if the class-adaptive priors $\nu_c^{(N)}$ converge weakly to the population priors $\nu_c^*$ as $N \to \infty$ , then robust class-score functions $V_c^{(N)}(x;\theta) \to V_c^*(x;\theta)$ , and the corresponding dual parameters also converge. This ensures consistency of loss and minimizers.

4. Algorithmic Structure and Training Workflow

Training in PG-DRO consists of two main phases:

Phase I – Prototype-Guided Priors Construction

Compute $(\mu_b, \Sigma_b)$ for each base class $b$ .
For each support point and class, calculate sample-to-prototype soft-min costs $C_{b n}$ .
Solve the entropic OT problem to produce coupling $T^*$ .
Derive mixture weights $\tilde w_{b c}$ and build class-wise priors $\nu_c$ as Gaussian mixtures.

Phase II – Robust Training

For each minibatch $(x_i, c_i^*)$ $(x_{i}, c_{i}^{*})$ :
- For each $x_i$ and class $c$ :
- Construct the Gibbs kernel $Q^{\nu_c}_{\varepsilon, x_i}$ .
- Solve the 1D convex dual subproblem using Newton or bisection methods for $(\lambda_c, V_c(x_i))$ .
- Compute batch cross-entropy loss and update model parameters:
$\theta \leftarrow \theta - \eta \nabla_\theta L$

A summary pseudocode appears below:

Algorithm PG-DRO
Input: D_base, support S, model f_θ, cost c, ambiguity radius ρ, parameters ε_s, ε_c.
Phase I – Priors: Compute mean/covariances for base classes, align supports via OT, extract T*, generate ν_c.
Phase II – Robust Training: For minibatch, per sample/class:
    Build Q^{ν_c}_{ε,x}, solve dual for V_c(x), compute cross-entropy, update θ.

5. Empirical Results and Benchmarks

PG-DRO's empirical performance was evaluated on multiple tasks:

Synthetic Gaussian Toy (K=8, d=10): Target classes were subject to mean shift, rotation, and scaling. Baseline comparisons included ERM and classical OT adaptation, using metrics of average accuracy and worst-10% class accuracy. PG-DRO improved average accuracy by 3–7% and worst-case by 2–10% under distribution shifts.
CIFAR-100 to CIFAR-10 Transfer: Using a ResNet-18 backbone pretrained on CIFAR-100, PG-DRO was benchmarked on few-shot supports ( $S \in \{1,5,10\}$ ) from CIFAR-10, against baselines including SAA, W-DRO, and plain few-shot fine-tuning. Robustness was tested via Laplace/Gaussian noise on features. PG-DRO consistently outperformed baselines, with gains of 1–3% over W-DRO and 2–5% over SAA, particularly under higher noise and fewer shots.
Cross-Domain Few-Shot (mini/tiered-ImageNet to CIFAR-100): Under similar setups and additive noise, PG-DRO improved performance by up to 5% at $k=10$ and was especially beneficial for the most challenging classes.
Computational Overhead: Phase I requires a single entropic OT solve of size $B \times N$ (with $B \approx 100$ –$1000$, $N \approx 5$ –$10$, and $100$–$200$ Sinkhorn iterations). Phase II robust logits incur $C$ convex solves per sample (Newton: 5–8 iterations). Total training cost is at most double that of standard cross-entropy.

6. Connections and Significance

PG-DRO unifies hierarchical OT and Sinkhorn DRO in a class-adaptive framework that enables organic integration of few-shot evidence and transferable knowledge for robust generalization. By constructing class-wise Gaussian-mixture priors, the method dynamically tailors the ambiguity set for each class, overcoming rigidity imposed by fixed reference distributions in earlier DRO approaches. The framework provides both computational tractability and rigorous statistical guarantees (contraction, convexity, and consistency), with robust empirical advantages, especially in few-shot and highly shifted scenarios (Sun et al., 1 Feb 2026).

A plausible implication is that methods leveraging adaptive OT priors may offer broader applicability in foundation model adaptation, continual learning, and other paradigms requiring principled handling of distribution shift with limited supervision.

Markdown Upgrade to Chat

References (1)

Robust Generalization with Adaptive Optimal Transport Priors for Decision-Focused Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prototype-Guided Distributionally Robust Optimization (PG-DRO).

Prototype-Guided DRO for Robust Few-Shot Learning

1. Construction of Class-Adaptive Priors via Hierarchical Optimal Transport

2. Integration with Sinkhorn Distributionally Robust Optimization

3. Theoretical Guarantees

4. Algorithmic Structure and Training Workflow

5. Empirical Results and Benchmarks

6. Connections and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Prototype-Guided DRO for Robust Few-Shot Learning

1. Construction of Class-Adaptive Priors via Hierarchical Optimal Transport

2. Integration with Sinkhorn Distributionally Robust Optimization

3. Theoretical Guarantees

4. Algorithmic Structure and Training Workflow

5. Empirical Results and Benchmarks

6. Connections and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research