Retrieval-Augmented ICL Scaling Law
- The method introduces T-similarity to compute reliable confidence scores, enhancing pseudo-label quality through ensemble diversity.
- It optimizes an ensemble of linear classifiers by balancing cross-entropy minimization on labeled data with diversity promotion on unlabeled data.
- Empirical results demonstrate that T-similarity significantly improves calibration and accuracy, especially under distribution shifts.
Below is a self‐contained exposition of the proposed T-similarity method for self-training and its relation to Self-Training with Classifier Disagreement (SCD). We follow the seven points requested:
- Self‐training loop
- Initialization
- We have a small labeled set and a large unlabeled set .
- A neural network feature extractor is randomly initialized (or pre‐trained).
- On top of the fixed feature extractor we initialize
- one “prediction” head for final classification (trained by standard cross‐entropy), and
- an ensemble of linear “confidence” heads.
- Ensemble construction
- Each linear head is a weight vector predicting logits .
- We fit by minimizing a joint objective
0
where the second term maximizes prediction diversity on 1. * Pseudo‐labeling via T-similarity * For each 2, compute
3
the average pairwise cosine‐like similarity between classifiers’ output distributions. * Convert 4 to a confidence score and compare to a threshold 5 (fixed or adaptive). * If 6, assign 7 as pseudo‐label. * Incorporation and retraining * Move the selected 8 into the labeled set, remove it from 9. * Retrain 0 on the augmented labeled set; re-optimize the ensemble 1. * Repeat for up to 2 rounds or until 3 is exhausted.
- Initialization
- Mathematical definition of T-similarity
- Let 4 be an ensemble of classifiers whose outputs lie in the probability simplex 5. Define
6
Equivalently, if each 7 is a probability vector,
8
(Proposition: 9.)
Intuition: large 0 means low disagreement → high confidence; low 1 means classifiers disperse → low confidence.
Pseudo‐labeling policies compare 2 against a threshold 3 (fixed, curriculum–adaptive, or transductive‐bound–based).
High‐level pseudocode 4 Differences in the three ψ‐policies:
- PL_θ: select all x with s_T(x)≥θ.
- CSTA_Δ: at iteration t choose θt as suitable quantile of {s_T(x)} to enforce |X_pl|≈Δ·|X_u|.
- MSTA: pick class‐specific thresholds by minimizing a transductive‐error bound: 4.
- Theoretical Analysis (binary case, linear heads)
- Objective (Problem (P)) for 5:
6
- Assumption A: ∀m, 7.
- Proposition 4.3 (loss properties): under A, 8 is strictly convex and coercive → unique global minimizer W*.
Stationary‐point Euler eq. ∇L(W)=0 reduces to a linear system (Proposition 4.4):
9
Theorem 4.5 (lower bound on diversity): let 0 be the solution, assume 1. Then
2
In particular 3, and high diversity ↔ large margins on labeled data.
Corollary 4.6 (role of representation): if all 4, then
5
Thus, spreading labeled features evenly (large 6) boosts diversity.
Experimental setup
- Datasets (13 SSL benchmarks):
- Biological: Cod-RNA, DNA, Protein, Splice
- Vision: COIL-20, Digits, MNIST
- Tabular: DryBean, Mushrooms, Phishing, Rice, Svmguide1
- Time series: HAR
- Labeling regimes:
- IID: random class‐balanced sampling.
- SSB: per‐class selection biased by 7 on first principal component (strength hyperparameter 8 tuned per dataset).
- Architecture & training:
- A 3-layer MLP feature extractor 9.
- Prediction head 0 + ensemble of 1 linear heads.
- Optimizer: Adam 2, 5 epochs ×100 iters/epoch.
- Diversity strength 3, cross-entropy on ℓ sup, LS-SVM style on ensemble.
- Baselines:
- ERM (supervised only)
- PL_{θ=0.8} (fixed threshold 0.8)
- CSTA_{Δ=0.4} (curriculum)
- MSTA (transductive‐bound). Each with softmax‐confidence vs T-similarity.
- Metrics: test accuracy (%), calibration (Expected Calibration Error).
- Datasets (13 SSL benchmarks):
- Quantitative results & ablations
- Failure of softmax under SSB: all four self-training schemes drop by up to 30 points vs IID; e.g. on Mushrooms softmax policies even underperform ERM.
- T-similarity gains under SSB:
- PL_{θ=0.8}: +8/13 datasets
- CSTA_{Δ=0.4}, MSTA: +11/13 datasets
- Improvements up to ~18 points in worst‐case SSB.
- Under IID, T-similarity performs on par with softmax (no degradation).
- Confidence‐distribution plots: T-similarity concentrates high confidence on correct predictions, low on errors; softmax remains overconfident for both.
- Calibration (ECE) vs γ: imposing diversity (γ>0) steadily improves calibration in both IID & SSB.
- Ablations:
- Pseudo‐label threshold θ∈{0.7,0.8,0.9,0.95}: T-sim robust across θ, softmax more brittle under SSB.
- Labeled set size n_ℓ∈[20…2000]: T-sim outperforms softmax for small n_ℓ under SSB, matches softmax as n_ℓ grows.
- Diversity weight γ∈{0,0.5,1,1.5,2}: any γ>0 helps under SSB; little harm under IID.
- Ensemble size M∈{2,5,10}: M=5 is a good compromise; gains persist for larger M but with diminishing returns.
- Practical recommendations & insights
- Plug‐and‐play: replace softmax‐confidence by T-similarity in any wrapper self-training.
- Modest overhead: only M=5 linear heads on top of fixed features.
- Choose γ>0 (even γ=0.5) to promote diversity.
- Ensemble size M≃5–10 suffices.
- Under distribution shift (SSB / covariate shift), T-similarity hugely stabilizes pseudo‐label quality.
- Ensure the learned feature space has spread (large λmin of Xℓ⊤X_ℓ) to maximize ensemble coverage (insight from Corollary 4.6)—akin to encouraging uniform embeddings.
- Connections to SCD: by explicitly encouraging classifier disagreement (negative‐correlation loss) on unlabeled data, T-similarity generalizes the SCD idea from two classifiers to a multi‐member ensemble, yielding a well-calibrated confidence measure for pseudo-labeling.
References to all equations and theorems are as numbered in the source paper.