Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive Distribution Calibration (ADC)

Updated 31 March 2026
  • Adaptive Distribution Calibration (ADC) is a method that calibrates class-conditional feature distributions in few-shot learning by transferring statistics from well-sampled base classes.
  • It employs a hierarchical optimal transport framework to compute adaptive weights for mixing base class means and covariances, reducing overfitting in novel classes.
  • Empirical results demonstrate that ADC improves classification accuracy and generalizes effectively across different benchmarks and domains.

Adaptive Distribution Calibration (ADC), also known as Hierarchical Optimal Transport (H-OT), is a principled method for calibrating feature distributions in few-shot learning. ADC is designed to address the issue of overfitting that arises when classifiers are trained on distributions estimated from very limited samples (the few-shot regime). Leveraging a hierarchical optimal transport framework, ADC adaptively transfers statistical information (mean and covariance) from a set of "base" classes with ample data to "novel" classes observed with only a few labeled examples, yielding calibrated distributions for improved downstream classification performance (Guo et al., 2022).

1. Formal Problem Setup

ADC operates in the few-shot classification setting, where the data is partitioned into a sizable base dataset

Dbase={(xbj,ybj)b=1,,B;j=1,,Jb}D_{\text{base}} = \{(x_b^j, y_b^j) \mid b=1,\ldots,B; j=1,\ldots,J_b\}

consisting of BB base classes, each with JbJ_b feature vectors xbjRVx_b^j \in \mathbb{R}^V, and a novel set

Dnovel={(xn,k,yn)n=1,,N;k=1,,K}D_{\text{novel}} = \{(x_{n,k}, y_n) \mid n=1,\ldots,N; k=1,\ldots,K\}

where each of NN novel classes has only KJbK \ll J_b support examples. It is assumed that in the feature space, instances from each class are approximately distributed according to a class-conditional Gaussian.

Due to the small sample size per novel class, empirical estimates of the mean and covariance are unreliable. ADC counters this by calibrating each novel class Gaussian via adaptive, soft transfer of statistics from the well-sampled base classes.

2. Calibration Objective and Distribution Mixing

The core calibration objective is to form for each novel class a new, “calibrated” Gaussian, parameterized by μn,k\mu'_{n,k} (mean) and Σn,k\Sigma'_{n,k} (covariance). Rather than direct estimation, ADC mixes the statistics of all BB base classes using adaptive weights. These weights are derived per (novel class, support example) pair, yielding personalized calibration across the support set.

To reduce statistical skew, features undergo a mild Tukey-power transform before calibration. Synthetic samples are then drawn from the resulting Gaussian, augmenting the original support set for classifier training—typically, logistic regression is employed.

3. Hierarchical Optimal Transport Framework

The determination of appropriate transfer weights between base and novel classes is cast as a hierarchical optimal transport (OT) problem: a two-level entropic OT computation.

3.1 High-level OT

At the first (high) level, the procedure defines a uniform distribution over base classes:

P=b=1B1BδRbP = \sum_{b=1}^B \frac{1}{B}\, \delta_{R_b}

and a uniform distribution over all NKN \cdot K transformed support features from the novel classes:

Q=n=1Nk=1K1NKδx~n,k.Q = \sum_{n=1}^N \sum_{k=1}^K \frac{1}{NK}\, \delta_{\tilde{x}_{n,k}}.

A transport plan TRB×(NK)T \in \mathbb{R}^{B \times (N K)} is calculated to minimize the entropic OT cost: \begin{align*} \text{OT}\epsilon(P, Q) = \min{T \in \Pi(P, Q)} & \sum_{b,n,k} C_{b,(n,k)}\, T_{b,(n,k)} - \epsilon \sum_{b,n,k} T_{b,(n,k)} \ln T_{b,(n,k)} \ \text{s.t.} & \sum_{n,k} T_{b,(n,k)} = \frac{1}{B}, \quad \sum_{b} T_{b,(n,k)} = \frac{1}{N K}. \end{align*} Here, Cb,(n,k)C_{b,(n,k)} is a learned cost (see below), and Tb,(n,k)T_{b,(n,k)} (obtained via Sinkhorn) provides a soft, per-sample, per-class transfer weight.

3.2 Low-level OT for Cost Learning

Rather than hand-specify Cb,(n,k)C_{b,(n,k)}, ADC learns it through a second OT within each base class. For base class bb:

  • Define the empirical measure

Rb=j=1JbpjbδxbjR_b = \sum_{j=1}^{J_b} p_j^b\, \delta_{x_b^j}

where pjbp_j^b is the normalized assignment probability from a pretrained classifier ϕ\phi.

  • Compute a low-level OT between RbR_b and the uniform distribution over novel supports QQ:
    • \epsilon \sum_{j,n,k} Mb_{j,(n,k)} \ln Mb_{j,(n,k)} \
    • \text{s.t.} &
    • \sum_{n,k} Mb_{j,(n,k)} = p_jb, \quad
    • \sum_{j} Mb_{j,(n,k)} = \frac{1}{N K}.
    • \end{align*}
  • Use ground cost Dj,(n,k)b=1cos(xbj,x~n,k)D^b_{j,(n,k)} = 1 - \cos(x_b^j, \tilde{x}_{n,k}).
  • Define the class-to-support cost as

Cb,(n,k)=j=1JbDj,(n,k)bMj,(n,k)b.C_{b,(n,k)} = \sum_{j=1}^{J_b} D^b_{j,(n,k)} M^b_{j,(n,k)}.

This two-level OT ensures that Cb,(n,k)C_{b,(n,k)} reflects the fine-grained empirical relationship between each base class and each novel support.

4. Adaptive Weights and Parameter Transfer

After solving the high-level OT, the adaptive weight matrix is derived as

wb,(n,k)=(NK)Tb,(n,k),bwb,(n,k)=1.w_{b,(n,k)} = (N K) T_{b,(n,k)}, \qquad \sum_b w_{b,(n,k)} = 1.

These weights govern the mixing of base-class statistics for each support example: \begin{align*} \mu'{n,k} &= \sum{b=1}B w_{b,(n,k)}\, \mu_b \ \Sigma'{n,k} &= \sum{b=1}B w_{b,(n,k)}\, \Sigma_b \end{align*} (In practice, the support point itself is added as an additional “class” with weight $1/(B+1)$; a small scalar α\alpha is added to Σn,k\Sigma'_{n,k} as regularization.)

5. Algorithmic Workflow and Complexity

The test-time ADC pipeline for a novel episode consists of:

  1. Feature extraction: Precompute (μb,Σb)(\mu_b, \Sigma_b) on DbaseD_\text{base}, pretrain ϕ\phi for pjbp_j^b.
  2. Support transformation: Apply Tukey-power transform to each xn,kx_{n,k}.
  3. Base class calibration: For each bb, form RbR_b and solve the low-level OT for MbM^b and Cb,(n,k)C_{b,(n,k)}.
  4. High-level OT: Solve for Tb,(n,k)T_{b,(n,k)} with cost CC.
  5. Statistics mixing: Compute adaptive statistics μn,k,Σn,k\mu'_{n,k}, \Sigma'_{n,k}.
  6. Augmentation: Sample synthetic features from N(μn,k,Σn,k)\mathcal{N}(\mu'_{n,k},\Sigma'_{n,k}) to augment the support.
  7. Classifier training: Fit a simple classifier to the union of support and generated features.

The computational burden is moderate: Sinkhorn OT between discrete measures of size nn is O(n2logn/ϵ2)O(n^2 \log n / \epsilon^2). High-level OT has complexity O(max{B,NK}2logmax{B,NK})O(\max\{B, NK\}^2 \log \max\{B,NK\}), and each low-level OT per base class is O(max{Jb,NK}2logmax{Jb,NK})O(\max\{J_b, NK\}^2 \log \max\{J_b,NK\}). Empirically, this procedure adds a few seconds per episode, with no fine-tuning of the feature backbone required.

6. Empirical Evaluation

ADC/H-OT demonstrates consistent improvements over prior calibration methods across standard few-shot classification benchmarks:

Dataset Backbone 1-shot Free-Lunch 1-shot H-OT 5-shot Free-Lunch 5-shot H-OT
miniImageNet WRN28 68.57% 69.04% 82.88% 84.36%
tieredImageNet WRN28 75.10% 75.91% 88.42% 89.33%
CUB, CIFAR-FS Comparable gains Comparable gains

Protocols involve 5-way, 1- and 5-shot settings, 10,000 test episodes, and report mean ± 95% confidence intervals. H-OT consistently produces higher accuracy than Free-Lunch baselines.

In cross-domain transfer (e.g., train on miniImageNet or CIFAR-FS, test on CUB), H-OT outperforms baselines by approximately 2 percentage points in 5-way 1-shot, indicating superior generalization.

Ablation studies support:

  • The benefit of low-level OT for adaptive cost learning (substantial boost over fixed-cost metrics).
  • Performance robustness and improvement as more base classes are incorporated, unlike prior methods which may degrade.
  • Greater sample efficiency: H-OT achieves parity with fewer generated samples per support.

7. Context and Significance

Adaptive Distribution Calibration/H-OT provides a fully differentiable, data-driven solution for transfer-weight learning in few-shot classification. The hierarchical OT structure enables finely resolved, context-dependent adaptation at both the class and sample levels, exceeding the capabilities of heuristic or fixed-weight calibration. As it does not require backbone retraining or fine-tuning, ADC is suitable as a plug-and-play module for few-shot inference.

The empirical results and ablation analyses indicate that ADC generalizes well within and across domains, achieves significant accuracy gains, and does so with modest computation. The use of entropic OT and the Sinkhorn algorithm enables both scalability and differentiability, making ADC compatible with contemporary machine learning pipelines and optimization approaches (Guo et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Distribution Calibration (ADC).