Adaptive Distribution Calibration (ADC)
- Adaptive Distribution Calibration (ADC) is a method that calibrates class-conditional feature distributions in few-shot learning by transferring statistics from well-sampled base classes.
- It employs a hierarchical optimal transport framework to compute adaptive weights for mixing base class means and covariances, reducing overfitting in novel classes.
- Empirical results demonstrate that ADC improves classification accuracy and generalizes effectively across different benchmarks and domains.
Adaptive Distribution Calibration (ADC), also known as Hierarchical Optimal Transport (H-OT), is a principled method for calibrating feature distributions in few-shot learning. ADC is designed to address the issue of overfitting that arises when classifiers are trained on distributions estimated from very limited samples (the few-shot regime). Leveraging a hierarchical optimal transport framework, ADC adaptively transfers statistical information (mean and covariance) from a set of "base" classes with ample data to "novel" classes observed with only a few labeled examples, yielding calibrated distributions for improved downstream classification performance (Guo et al., 2022).
1. Formal Problem Setup
ADC operates in the few-shot classification setting, where the data is partitioned into a sizable base dataset
consisting of base classes, each with feature vectors , and a novel set
where each of novel classes has only support examples. It is assumed that in the feature space, instances from each class are approximately distributed according to a class-conditional Gaussian.
Due to the small sample size per novel class, empirical estimates of the mean and covariance are unreliable. ADC counters this by calibrating each novel class Gaussian via adaptive, soft transfer of statistics from the well-sampled base classes.
2. Calibration Objective and Distribution Mixing
The core calibration objective is to form for each novel class a new, “calibrated” Gaussian, parameterized by (mean) and (covariance). Rather than direct estimation, ADC mixes the statistics of all base classes using adaptive weights. These weights are derived per (novel class, support example) pair, yielding personalized calibration across the support set.
To reduce statistical skew, features undergo a mild Tukey-power transform before calibration. Synthetic samples are then drawn from the resulting Gaussian, augmenting the original support set for classifier training—typically, logistic regression is employed.
3. Hierarchical Optimal Transport Framework
The determination of appropriate transfer weights between base and novel classes is cast as a hierarchical optimal transport (OT) problem: a two-level entropic OT computation.
3.1 High-level OT
At the first (high) level, the procedure defines a uniform distribution over base classes:
and a uniform distribution over all transformed support features from the novel classes:
A transport plan is calculated to minimize the entropic OT cost: \begin{align*} \text{OT}\epsilon(P, Q) = \min{T \in \Pi(P, Q)} & \sum_{b,n,k} C_{b,(n,k)}\, T_{b,(n,k)} - \epsilon \sum_{b,n,k} T_{b,(n,k)} \ln T_{b,(n,k)} \ \text{s.t.} & \sum_{n,k} T_{b,(n,k)} = \frac{1}{B}, \quad \sum_{b} T_{b,(n,k)} = \frac{1}{N K}. \end{align*} Here, is a learned cost (see below), and (obtained via Sinkhorn) provides a soft, per-sample, per-class transfer weight.
3.2 Low-level OT for Cost Learning
Rather than hand-specify , ADC learns it through a second OT within each base class. For base class :
- Define the empirical measure
where is the normalized assignment probability from a pretrained classifier .
- Compute a low-level OT between and the uniform distribution over novel supports :
- \epsilon \sum_{j,n,k} Mb_{j,(n,k)} \ln Mb_{j,(n,k)} \
- \text{s.t.} &
- \sum_{n,k} Mb_{j,(n,k)} = p_jb, \quad
- \sum_{j} Mb_{j,(n,k)} = \frac{1}{N K}.
- \end{align*}
- Use ground cost .
- Define the class-to-support cost as
This two-level OT ensures that reflects the fine-grained empirical relationship between each base class and each novel support.
4. Adaptive Weights and Parameter Transfer
After solving the high-level OT, the adaptive weight matrix is derived as
These weights govern the mixing of base-class statistics for each support example: \begin{align*} \mu'{n,k} &= \sum{b=1}B w_{b,(n,k)}\, \mu_b \ \Sigma'{n,k} &= \sum{b=1}B w_{b,(n,k)}\, \Sigma_b \end{align*} (In practice, the support point itself is added as an additional “class” with weight $1/(B+1)$; a small scalar is added to as regularization.)
5. Algorithmic Workflow and Complexity
The test-time ADC pipeline for a novel episode consists of:
- Feature extraction: Precompute on , pretrain for .
- Support transformation: Apply Tukey-power transform to each .
- Base class calibration: For each , form and solve the low-level OT for and .
- High-level OT: Solve for with cost .
- Statistics mixing: Compute adaptive statistics .
- Augmentation: Sample synthetic features from to augment the support.
- Classifier training: Fit a simple classifier to the union of support and generated features.
The computational burden is moderate: Sinkhorn OT between discrete measures of size is . High-level OT has complexity , and each low-level OT per base class is . Empirically, this procedure adds a few seconds per episode, with no fine-tuning of the feature backbone required.
6. Empirical Evaluation
ADC/H-OT demonstrates consistent improvements over prior calibration methods across standard few-shot classification benchmarks:
| Dataset | Backbone | 1-shot Free-Lunch | 1-shot H-OT | 5-shot Free-Lunch | 5-shot H-OT |
|---|---|---|---|---|---|
| miniImageNet | WRN28 | 68.57% | 69.04% | 82.88% | 84.36% |
| tieredImageNet | WRN28 | 75.10% | 75.91% | 88.42% | 89.33% |
| CUB, CIFAR-FS | — | Comparable gains | — | Comparable gains | — |
Protocols involve 5-way, 1- and 5-shot settings, 10,000 test episodes, and report mean ± 95% confidence intervals. H-OT consistently produces higher accuracy than Free-Lunch baselines.
In cross-domain transfer (e.g., train on miniImageNet or CIFAR-FS, test on CUB), H-OT outperforms baselines by approximately 2 percentage points in 5-way 1-shot, indicating superior generalization.
Ablation studies support:
- The benefit of low-level OT for adaptive cost learning (substantial boost over fixed-cost metrics).
- Performance robustness and improvement as more base classes are incorporated, unlike prior methods which may degrade.
- Greater sample efficiency: H-OT achieves parity with fewer generated samples per support.
7. Context and Significance
Adaptive Distribution Calibration/H-OT provides a fully differentiable, data-driven solution for transfer-weight learning in few-shot classification. The hierarchical OT structure enables finely resolved, context-dependent adaptation at both the class and sample levels, exceeding the capabilities of heuristic or fixed-weight calibration. As it does not require backbone retraining or fine-tuning, ADC is suitable as a plug-and-play module for few-shot inference.
The empirical results and ablation analyses indicate that ADC generalizes well within and across domains, achieves significant accuracy gains, and does so with modest computation. The use of entropic OT and the Sinkhorn algorithm enables both scalability and differentiability, making ADC compatible with contemporary machine learning pipelines and optimization approaches (Guo et al., 2022).