Adaptive Distribution Calibration (ADC)

Updated 31 March 2026

Adaptive Distribution Calibration (ADC) is a method that calibrates class-conditional feature distributions in few-shot learning by transferring statistics from well-sampled base classes.
It employs a hierarchical optimal transport framework to compute adaptive weights for mixing base class means and covariances, reducing overfitting in novel classes.
Empirical results demonstrate that ADC improves classification accuracy and generalizes effectively across different benchmarks and domains.

Adaptive Distribution Calibration (ADC), also known as Hierarchical Optimal Transport (H-OT), is a principled method for calibrating feature distributions in few-shot learning. ADC is designed to address the issue of overfitting that arises when classifiers are trained on distributions estimated from very limited samples (the few-shot regime). Leveraging a hierarchical optimal transport framework, ADC adaptively transfers statistical information (mean and covariance) from a set of "base" classes with ample data to "novel" classes observed with only a few labeled examples, yielding calibrated distributions for improved downstream classification performance (Guo et al., 2022).

1. Formal Problem Setup

ADC operates in the few-shot classification setting, where the data is partitioned into a sizable base dataset

$D_{\text{base}} = \{(x_b^j, y_b^j) \mid b=1,\ldots,B; j=1,\ldots,J_b\}$

consisting of $B$ base classes, each with $J_b$ feature vectors $x_b^j \in \mathbb{R}^V$ , and a novel set

$D_{\text{novel}} = \{(x_{n,k}, y_n) \mid n=1,\ldots,N; k=1,\ldots,K\}$

where each of $N$ novel classes has only $K \ll J_b$ support examples. It is assumed that in the feature space, instances from each class are approximately distributed according to a class-conditional Gaussian.

Due to the small sample size per novel class, empirical estimates of the mean and covariance are unreliable. ADC counters this by calibrating each novel class Gaussian via adaptive, soft transfer of statistics from the well-sampled base classes.

2. Calibration Objective and Distribution Mixing

The core calibration objective is to form for each novel class a new, “calibrated” Gaussian, parameterized by $\mu'_{n,k}$ (mean) and $\Sigma'_{n,k}$ (covariance). Rather than direct estimation, ADC mixes the statistics of all $B$ base classes using adaptive weights. These weights are derived per (novel class, support example) pair, yielding personalized calibration across the support set.

To reduce statistical skew, features undergo a mild Tukey-power transform before calibration. Synthetic samples are then drawn from the resulting Gaussian, augmenting the original support set for classifier training—typically, logistic regression is employed.

3. Hierarchical Optimal Transport Framework

The determination of appropriate transfer weights between base and novel classes is cast as a hierarchical optimal transport (OT) problem: a two-level entropic OT computation.

3.1 High-level OT

At the first (high) level, the procedure defines a uniform distribution over base classes:

$P = \sum_{b=1}^B \frac{1}{B}\, \delta_{R_b}$

and a uniform distribution over all $N \cdot K$ transformed support features from the novel classes:

$Q = \sum_{n=1}^N \sum_{k=1}^K \frac{1}{NK}\, \delta_{\tilde{x}_{n,k}}.$

A transport plan $T \in \mathbb{R}^{B \times (N K)}$ is calculated to minimize the entropic OT cost: \begin{align*} \text{OT}\epsilon(P, Q) = \min{T \in \Pi(P, Q)} & \sum_{b,n,k} C_{b,(n,k)}\, T_{b,(n,k)} - \epsilon \sum_{b,n,k} T_{b,(n,k)} \ln T_{b,(n,k)} \ \text{s.t.} & \sum_{n,k} T_{b,(n,k)} = \frac{1}{B}, \quad \sum_{b} T_{b,(n,k)} = \frac{1}{N K}. \end{align*} Here, $C_{b,(n,k)}$ is a learned cost (see below), and $T_{b,(n,k)}$ (obtained via Sinkhorn) provides a soft, per-sample, per-class transfer weight.

3.2 Low-level OT for Cost Learning

Rather than hand-specify $C_{b,(n,k)}$ , ADC learns it through a second OT within each base class. For base class $b$ :

Define the empirical measure

$R_b = \sum_{j=1}^{J_b} p_j^b\, \delta_{x_b^j}$

where $p_j^b$ is the normalized assignment probability from a pretrained classifier $\phi$ .

Compute a low-level OT between $R_b$ $R_{b}$ and the uniform distribution over novel supports $Q$ $Q$ :
- \epsilon \sum_{j,n,k} M^b_{j,(n,k)} \ln M^b_{j,(n,k)} \
- \text{s.t.} &
- \sum_{n,k} M^b_{j,(n,k)} = p_j^b, \quad
- \sum_{j} M^b_{j,(n,k)} = \frac{1}{N K}.
- \end{align*}
Use ground cost $D^b_{j,(n,k)} = 1 - \cos(x_b^j, \tilde{x}_{n,k})$ .
Define the class-to-support cost as

$C_{b,(n,k)} = \sum_{j=1}^{J_b} D^b_{j,(n,k)} M^b_{j,(n,k)}.$

This two-level OT ensures that $C_{b,(n,k)}$ reflects the fine-grained empirical relationship between each base class and each novel support.

4. Adaptive Weights and Parameter Transfer

After solving the high-level OT, the adaptive weight matrix is derived as

$w_{b,(n,k)} = (N K) T_{b,(n,k)}, \qquad \sum_b w_{b,(n,k)} = 1.$

These weights govern the mixing of base-class statistics for each support example: \begin{align*} \mu'{n,k} &= \sum{b=1}^B w_{b,(n,k)}\, \mu_b \ \Sigma'{n,k} &= \sum{b=1}^B w_{b,(n,k)}\, \Sigma_b \end{align*} (In practice, the support point itself is added as an additional “class” with weight $1/(B+1)$; a small scalar $\alpha$ is added to $\Sigma'_{n,k}$ as regularization.)

5. Algorithmic Workflow and Complexity

The test-time ADC pipeline for a novel episode consists of:

Feature extraction: Precompute $(\mu_b, \Sigma_b)$ on $D_\text{base}$ , pretrain $\phi$ for $p_j^b$ .
Support transformation: Apply Tukey-power transform to each $x_{n,k}$ .
Base class calibration: For each $b$ , form $R_b$ and solve the low-level OT for $M^b$ and $C_{b,(n,k)}$ .
High-level OT: Solve for $T_{b,(n,k)}$ with cost $C$ .
Statistics mixing: Compute adaptive statistics $\mu'_{n,k}, \Sigma'_{n,k}$ .
Augmentation: Sample synthetic features from $\mathcal{N}(\mu'_{n,k},\Sigma'_{n,k})$ to augment the support.
Classifier training: Fit a simple classifier to the union of support and generated features.

The computational burden is moderate: Sinkhorn OT between discrete measures of size $n$ is $O(n^2 \log n / \epsilon^2)$ . High-level OT has complexity $O(\max\{B, NK\}^2 \log \max\{B,NK\})$ , and each low-level OT per base class is $O(\max\{J_b, NK\}^2 \log \max\{J_b,NK\})$ . Empirically, this procedure adds a few seconds per episode, with no fine-tuning of the feature backbone required.

6. Empirical Evaluation

ADC/H-OT demonstrates consistent improvements over prior calibration methods across standard few-shot classification benchmarks:

Dataset	Backbone	1-shot Free-Lunch	1-shot H-OT	5-shot Free-Lunch	5-shot H-OT
miniImageNet	WRN28	68.57%	69.04%	82.88%	84.36%
tieredImageNet	WRN28	75.10%	75.91%	88.42%	89.33%
CUB, CIFAR-FS	—	Comparable gains	—	Comparable gains	—

Protocols involve 5-way, 1- and 5-shot settings, 10,000 test episodes, and report mean ± 95% confidence intervals. H-OT consistently produces higher accuracy than Free-Lunch baselines.

In cross-domain transfer (e.g., train on miniImageNet or CIFAR-FS, test on CUB), H-OT outperforms baselines by approximately 2 percentage points in 5-way 1-shot, indicating superior generalization.

Ablation studies support:

The benefit of low-level OT for adaptive cost learning (substantial boost over fixed-cost metrics).
Performance robustness and improvement as more base classes are incorporated, unlike prior methods which may degrade.
Greater sample efficiency: H-OT achieves parity with fewer generated samples per support.

7. Context and Significance

Adaptive Distribution Calibration/H-OT provides a fully differentiable, data-driven solution for transfer-weight learning in few-shot classification. The hierarchical OT structure enables finely resolved, context-dependent adaptation at both the class and sample levels, exceeding the capabilities of heuristic or fixed-weight calibration. As it does not require backbone retraining or fine-tuning, ADC is suitable as a plug-and-play module for few-shot inference.

The empirical results and ablation analyses indicate that ADC generalizes well within and across domains, achieves significant accuracy gains, and does so with modest computation. The use of entropic OT and the Sinkhorn algorithm enables both scalability and differentiability, making ADC compatible with contemporary machine learning pipelines and optimization approaches (Guo et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Distribution Calibration (ADC).