Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prototype-Guided DRO for Robust Few-Shot Learning

Updated 8 February 2026
  • The paper introduces PG-DRO that constructs class-adaptive priors using hierarchical optimal transport to align few-shot supports with base prototypes.
  • Methodology integrates Sinkhorn DRO with Gaussian-mixture priors, enabling efficient and robust learning under distribution shifts.
  • Empirical results reveal improvements of 3–7% in average accuracy and enhanced worst-case performance across challenging few-shot and cross-domain tasks.

Prototype-Guided Distributionally Robust Optimization (PG-DRO) is a framework developed to address the challenge of robust generalization in few-shot learning under distribution shifts. Standard Sinkhorn Distributionally Robust Optimization (DRO) relies on a single, fixed reference distribution, which can limit adaptability, especially when information from classes is scarce and subject to shift. PG-DRO integrates hierarchical optimal transport (OT) to learn class-adaptive priors, termed "prototypes," from base data and incorporates them into a multi-prior Sinkhorn DRO formulation for efficient, theoretically grounded, class-specific robust optimization (Sun et al., 1 Feb 2026).

1. Construction of Class-Adaptive Priors via Hierarchical Optimal Transport

PG-DRO begins by leveraging a rich base dataset with BB classes, each represented by empirical means μb\mu_b and covariances Σb\Sigma_b. These serve as base prototypes. Given a few-shot support set SkS_k from a novel class cc consisting of NN labeled instances (x~n,n)(\tilde x_n,\ell_n), the framework aligns each support example to the base classes through a two-level OT process:

  1. Sample-to-prototype soft-min cost: For each support example x~n\tilde x_n and base class bb, compute the entropic soft-min cost:

Cbn=εslogi=1mbexp(c(x~n,xb,i)/εs)C_{b n} = -\varepsilon_s \log \sum_{i=1}^{m_b} \exp(-c(\tilde x_n, x_{b,i})/\varepsilon_s)

where c(,)c(\cdot,\cdot) is the ground cost (e.g., Euclidean distance) and εs\varepsilon_s is a small sample-level temperature.

  1. Class-level entropic OT: Solve the entropic regularized OT problem to obtain a coupling TT^*:

T=argminT0C,T+εcH(T)T^* = \arg\min_{T \geq 0} \langle C, T \rangle + \varepsilon_c H(T)

subject to T1N=αT 1_N = \alpha (uniform over BB), TT1B=βT^T 1_B = \beta (empirical class-support counts), and H(T)=b,nTbnlogTbnH(T) = -\sum_{b,n} T_{bn} \log T_{bn}; εc\varepsilon_c is the class-level entropic regularization.

  1. Prototype weights: For each target class cc, mixture weights w~bc\tilde w_{b c} are obtained by marginalizing TT^* over support points with label cc:

w~bc=n:n=cTbnbn:n=cTbn\tilde w_{b c} = \frac{\sum_{n: \ell_n=c} T^*_{b n}}{ \sum_{b'} \sum_{n: \ell_n=c} T^*_{b'n} }

  1. Gaussian-mixture prior: These mixture weights define the class-adaptive prior:

νc(x)=b=1Bw~bcN(x;μb,Σb)\nu_c(x) = \sum_{b=1}^B \tilde w_{b c} \mathcal{N}(x; \mu_b, \Sigma_b)

which places highest mass on base prototypes most aligned to the few-shot support for class cc.

2. Integration with Sinkhorn Distributionally Robust Optimization

Having constructed class-adaptive priors νc\nu_c, PG-DRO embeds them into the Sinkhorn DRO dual. The DRO objective considers distributions PP within an entropic-Wasserstein ball Wε(P,μ^)ρW_\varepsilon(P, \hat\mu) \leq \rho centered at the empirical data μ^\hat\mu:

minθsupP:Wε(P,μ^)ρE(x,y)Pθ(x,y)\min_\theta \sup_{P: W_\varepsilon(P, \hat\mu) \leq \rho} \mathbb{E}_{(x,y) \sim P} \ell_\theta(x, y)

Using duality, this reduces to a convex minimization for each class cc:

Vc(x;θ)=minλc0{λcρ+λcεlogEyQε,xνc[exp(fc(y;θ)λcε)]}V_c(x; \theta) = \min_{\lambda_c \geq 0} \left\{ \lambda_c \rho + \lambda_c \varepsilon \log \mathbb{E}_{y \sim Q^{\nu_c}_{\varepsilon,x}} \left[ \exp\left( \frac{f_c(y;\theta)}{\lambda_c \varepsilon} \right) \right] \right\}

where Qε,xνcQ^{\nu_c}_{\varepsilon, x} is the Gibbs kernel:

dQε,xνc(y)=exp(c(x,y)/ε)dνc(y)Zνc(x)d Q^{\nu_c}_{\varepsilon,x}(y) = \frac{ \exp( -c(x,y)/\varepsilon ) d\nu_c(y) }{ Z_{\nu_c}(x) }

and Zνc(x)Z_{\nu_c}(x) is the partition function.

Prediction uses robust softmax:

h^(x)=argmaxcVc(x;θ)\hat h(x) = \arg\max_c V_c(x; \theta)

and the cross-entropy loss over robust class-scores VcV_c:

LCE(x,c)=logexp[Vc(x;θ)]cexp[Vc(x;θ)]L_{CE}(x, c^*) = -\log \frac{ \exp[V_{c^*}(x; \theta)] }{ \sum_c \exp[V_c(x; \theta)] }

3. Theoretical Guarantees

PG-DRO establishes several theoretical properties:

  • Convexity and Uniqueness: For any fixed prior ν\nu, the DRO dual objective ϕ(λ)=λρ+λεExlogEyQε,xν[exp(/(λε))]\phi(\lambda) = \lambda \rho + \lambda \varepsilon \mathbb{E}_x \log \mathbb{E}_{y \sim Q^\nu_{\varepsilon,x}} [\exp(\ell / (\lambda\varepsilon))] is strictly convex in λ>0\lambda > 0 and has a unique minimizer. By extension, the sum over classes c=1Cϕc(λc)\sum_{c=1}^C \phi_c(\lambda_c) retains convexity and uniqueness for each λc\lambda_c.
  • Contraction Property under Adaptive OT: Denoting the mixture prior at iteration tt as ν(t)\nu^{(t)} and the population fixed-point as ν\nu^*, the deviation Δt(θ)=Vν(t)(θ)Vν(θ)\Delta_t(\theta) = |V_{\nu^{(t)}}(\theta) - V_{\nu^*}(\theta)| contracts linearly under a locally contractive OT map and sufficiently accurate support count estimates:

Δt+1(1ηtκ)Δt+O(N1/2)\Delta_{t+1} \leq (1 - \eta_t \kappa) \Delta_t + O(N^{-1/2})

implying convergence as tt \to \infty and NN \to \infty.

  • Consistency: Under continuity and Lipschitz conditions, if the class-adaptive priors νc(N)\nu_c^{(N)} converge weakly to the population priors νc\nu_c^* as NN \to \infty, then robust class-score functions Vc(N)(x;θ)Vc(x;θ)V_c^{(N)}(x;\theta) \to V_c^*(x;\theta), and the corresponding dual parameters also converge. This ensures consistency of loss and minimizers.

4. Algorithmic Structure and Training Workflow

Training in PG-DRO consists of two main phases:

Phase I – Prototype-Guided Priors Construction

  1. Compute (μb,Σb)(\mu_b, \Sigma_b) for each base class bb.
  2. For each support point and class, calculate sample-to-prototype soft-min costs CbnC_{b n}.
  3. Solve the entropic OT problem to produce coupling TT^*.
  4. Derive mixture weights w~bc\tilde w_{b c} and build class-wise priors νc\nu_c as Gaussian mixtures.

Phase II – Robust Training

  • For each minibatch (xi,ci)(x_i, c_i^*):

    • For each xix_i and class cc:
    • Construct the Gibbs kernel Qε,xiνcQ^{\nu_c}_{\varepsilon, x_i}.
    • Solve the 1D convex dual subproblem using Newton or bisection methods for (λc,Vc(xi))(\lambda_c, V_c(x_i)).
    • Compute batch cross-entropy loss and update model parameters:

    θθηθL\theta \leftarrow \theta - \eta \nabla_\theta L

A summary pseudocode appears below:

1
2
3
4
5
Algorithm PG-DRO
Input: D_base, support S, model f_θ, cost c, ambiguity radius ρ, parameters ε_s, ε_c.
Phase I – Priors: Compute mean/covariances for base classes, align supports via OT, extract T*, generate ν_c.
Phase II – Robust Training: For minibatch, per sample/class:
    Build Q^{ν_c}_{ε,x}, solve dual for V_c(x), compute cross-entropy, update θ.

5. Empirical Results and Benchmarks

PG-DRO's empirical performance was evaluated on multiple tasks:

  • Synthetic Gaussian Toy (K=8, d=10): Target classes were subject to mean shift, rotation, and scaling. Baseline comparisons included ERM and classical OT adaptation, using metrics of average accuracy and worst-10% class accuracy. PG-DRO improved average accuracy by 3–7% and worst-case by 2–10% under distribution shifts.
  • CIFAR-100 to CIFAR-10 Transfer: Using a ResNet-18 backbone pretrained on CIFAR-100, PG-DRO was benchmarked on few-shot supports (S{1,5,10}S \in \{1,5,10\}) from CIFAR-10, against baselines including SAA, W-DRO, and plain few-shot fine-tuning. Robustness was tested via Laplace/Gaussian noise on features. PG-DRO consistently outperformed baselines, with gains of 1–3% over W-DRO and 2–5% over SAA, particularly under higher noise and fewer shots.
  • Cross-Domain Few-Shot (mini/tiered-ImageNet to CIFAR-100): Under similar setups and additive noise, PG-DRO improved performance by up to 5% at k=10k=10 and was especially beneficial for the most challenging classes.
  • Computational Overhead: Phase I requires a single entropic OT solve of size B×NB \times N (with B100B \approx 100–$1000$, N5N \approx 5–$10$, and $100$–$200$ Sinkhorn iterations). Phase II robust logits incur CC convex solves per sample (Newton: 5–8 iterations). Total training cost is at most double that of standard cross-entropy.

6. Connections and Significance

PG-DRO unifies hierarchical OT and Sinkhorn DRO in a class-adaptive framework that enables organic integration of few-shot evidence and transferable knowledge for robust generalization. By constructing class-wise Gaussian-mixture priors, the method dynamically tailors the ambiguity set for each class, overcoming rigidity imposed by fixed reference distributions in earlier DRO approaches. The framework provides both computational tractability and rigorous statistical guarantees (contraction, convexity, and consistency), with robust empirical advantages, especially in few-shot and highly shifted scenarios (Sun et al., 1 Feb 2026).

A plausible implication is that methods leveraging adaptive OT priors may offer broader applicability in foundation model adaptation, continual learning, and other paradigms requiring principled handling of distribution shift with limited supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prototype-Guided Distributionally Robust Optimization (PG-DRO).