Prototype-Guided DRO for Robust Few-Shot Learning
- The paper introduces PG-DRO that constructs class-adaptive priors using hierarchical optimal transport to align few-shot supports with base prototypes.
- Methodology integrates Sinkhorn DRO with Gaussian-mixture priors, enabling efficient and robust learning under distribution shifts.
- Empirical results reveal improvements of 3–7% in average accuracy and enhanced worst-case performance across challenging few-shot and cross-domain tasks.
Prototype-Guided Distributionally Robust Optimization (PG-DRO) is a framework developed to address the challenge of robust generalization in few-shot learning under distribution shifts. Standard Sinkhorn Distributionally Robust Optimization (DRO) relies on a single, fixed reference distribution, which can limit adaptability, especially when information from classes is scarce and subject to shift. PG-DRO integrates hierarchical optimal transport (OT) to learn class-adaptive priors, termed "prototypes," from base data and incorporates them into a multi-prior Sinkhorn DRO formulation for efficient, theoretically grounded, class-specific robust optimization (Sun et al., 1 Feb 2026).
1. Construction of Class-Adaptive Priors via Hierarchical Optimal Transport
PG-DRO begins by leveraging a rich base dataset with classes, each represented by empirical means and covariances . These serve as base prototypes. Given a few-shot support set from a novel class consisting of labeled instances , the framework aligns each support example to the base classes through a two-level OT process:
- Sample-to-prototype soft-min cost: For each support example and base class , compute the entropic soft-min cost:
where is the ground cost (e.g., Euclidean distance) and is a small sample-level temperature.
- Class-level entropic OT: Solve the entropic regularized OT problem to obtain a coupling :
subject to (uniform over ), (empirical class-support counts), and ; is the class-level entropic regularization.
- Prototype weights: For each target class , mixture weights are obtained by marginalizing over support points with label :
- Gaussian-mixture prior: These mixture weights define the class-adaptive prior:
which places highest mass on base prototypes most aligned to the few-shot support for class .
2. Integration with Sinkhorn Distributionally Robust Optimization
Having constructed class-adaptive priors , PG-DRO embeds them into the Sinkhorn DRO dual. The DRO objective considers distributions within an entropic-Wasserstein ball centered at the empirical data :
Using duality, this reduces to a convex minimization for each class :
where is the Gibbs kernel:
and is the partition function.
Prediction uses robust softmax:
and the cross-entropy loss over robust class-scores :
3. Theoretical Guarantees
PG-DRO establishes several theoretical properties:
- Convexity and Uniqueness: For any fixed prior , the DRO dual objective is strictly convex in and has a unique minimizer. By extension, the sum over classes retains convexity and uniqueness for each .
- Contraction Property under Adaptive OT: Denoting the mixture prior at iteration as and the population fixed-point as , the deviation contracts linearly under a locally contractive OT map and sufficiently accurate support count estimates:
implying convergence as and .
- Consistency: Under continuity and Lipschitz conditions, if the class-adaptive priors converge weakly to the population priors as , then robust class-score functions , and the corresponding dual parameters also converge. This ensures consistency of loss and minimizers.
4. Algorithmic Structure and Training Workflow
Training in PG-DRO consists of two main phases:
Phase I – Prototype-Guided Priors Construction
- Compute for each base class .
- For each support point and class, calculate sample-to-prototype soft-min costs .
- Solve the entropic OT problem to produce coupling .
- Derive mixture weights and build class-wise priors as Gaussian mixtures.
Phase II – Robust Training
- For each minibatch :
- For each and class :
- Construct the Gibbs kernel .
- Solve the 1D convex dual subproblem using Newton or bisection methods for .
- Compute batch cross-entropy loss and update model parameters:
A summary pseudocode appears below:
1 2 3 4 5 |
Algorithm PG-DRO
Input: D_base, support S, model f_θ, cost c, ambiguity radius ρ, parameters ε_s, ε_c.
Phase I – Priors: Compute mean/covariances for base classes, align supports via OT, extract T*, generate ν_c.
Phase II – Robust Training: For minibatch, per sample/class:
Build Q^{ν_c}_{ε,x}, solve dual for V_c(x), compute cross-entropy, update θ. |
5. Empirical Results and Benchmarks
PG-DRO's empirical performance was evaluated on multiple tasks:
- Synthetic Gaussian Toy (K=8, d=10): Target classes were subject to mean shift, rotation, and scaling. Baseline comparisons included ERM and classical OT adaptation, using metrics of average accuracy and worst-10% class accuracy. PG-DRO improved average accuracy by 3–7% and worst-case by 2–10% under distribution shifts.
- CIFAR-100 to CIFAR-10 Transfer: Using a ResNet-18 backbone pretrained on CIFAR-100, PG-DRO was benchmarked on few-shot supports () from CIFAR-10, against baselines including SAA, W-DRO, and plain few-shot fine-tuning. Robustness was tested via Laplace/Gaussian noise on features. PG-DRO consistently outperformed baselines, with gains of 1–3% over W-DRO and 2–5% over SAA, particularly under higher noise and fewer shots.
- Cross-Domain Few-Shot (mini/tiered-ImageNet to CIFAR-100): Under similar setups and additive noise, PG-DRO improved performance by up to 5% at and was especially beneficial for the most challenging classes.
- Computational Overhead: Phase I requires a single entropic OT solve of size (with –$1000$, –$10$, and $100$–$200$ Sinkhorn iterations). Phase II robust logits incur convex solves per sample (Newton: 5–8 iterations). Total training cost is at most double that of standard cross-entropy.
6. Connections and Significance
PG-DRO unifies hierarchical OT and Sinkhorn DRO in a class-adaptive framework that enables organic integration of few-shot evidence and transferable knowledge for robust generalization. By constructing class-wise Gaussian-mixture priors, the method dynamically tailors the ambiguity set for each class, overcoming rigidity imposed by fixed reference distributions in earlier DRO approaches. The framework provides both computational tractability and rigorous statistical guarantees (contraction, convexity, and consistency), with robust empirical advantages, especially in few-shot and highly shifted scenarios (Sun et al., 1 Feb 2026).
A plausible implication is that methods leveraging adaptive OT priors may offer broader applicability in foundation model adaptation, continual learning, and other paradigms requiring principled handling of distribution shift with limited supervision.