Retrieval-Augmented ICL Scaling Law

Updated 9 April 2026

The method introduces T-similarity to compute reliable confidence scores, enhancing pseudo-label quality through ensemble diversity.
It optimizes an ensemble of linear classifiers by balancing cross-entropy minimization on labeled data with diversity promotion on unlabeled data.
Empirical results demonstrate that T-similarity significantly improves calibration and accuracy, especially under distribution shifts.

Below is a self‐contained exposition of the proposed T-similarity method for self-training and its relation to Self-Training with Classifier Disagreement (SCD). We follow the seven points requested:

Self‐training loop
- Initialization
  - We have a small labeled set $(\mathbf{X}_\ell,y_\ell)$ and a large unlabeled set $\mathbf{X}_u$ .
  - A neural network feature extractor $f_\theta(\cdot)$ is randomly initialized (or pre‐trained).
  - On top of the fixed feature extractor we initialize
    - one “prediction” head $h_{\rm pred}$ for final classification (trained by standard cross‐entropy), and
    - an ensemble $\mathcal T=\{h_m\}_{m=1}^M$ of $M$ linear “confidence” heads.
- Ensemble construction
  - Each linear head $h_m$ is a weight vector $\omega_m\in\mathbb R^d$ predicting logits $h_m(x)=\omega_m^\top f_\theta(x)$ .
  - We fit $\{\omega_m\}$ by minimizing a joint objective
$\mathbf{X}_u$ 0

where the second term maximizes prediction diversity on $\mathbf{X}_u$ 1. * Pseudo‐labeling via T-similarity * For each $\mathbf{X}_u$ 2, compute

$\mathbf{X}_u$ 3

the average pairwise cosine‐like similarity between classifiers’ output distributions. * Convert $\mathbf{X}_u$ 4 to a confidence score and compare to a threshold $\mathbf{X}_u$ 5 (fixed or adaptive). * If $\mathbf{X}_u$ 6, assign $\mathbf{X}_u$ 7 as pseudo‐label. * Incorporation and retraining * Move the selected $\mathbf{X}_u$ 8 into the labeled set, remove it from $\mathbf{X}_u$ 9. * Retrain $f_\theta(\cdot)$ 0 on the augmented labeled set; re-optimize the ensemble $f_\theta(\cdot)$ 1. * Repeat for up to $f_\theta(\cdot)$ 2 rounds or until $f_\theta(\cdot)$ 3 is exhausted.
Mathematical definition of T-similarity
- Let $f_\theta(\cdot)$ 4 be an ensemble of classifiers whose outputs lie in the probability simplex $f_\theta(\cdot)$ 5. Define
$f_\theta(\cdot)$ 6

Equivalently, if each $f_\theta(\cdot)$ 7 is a probability vector,

$f_\theta(\cdot)$ 8

(Proposition: $f_\theta(\cdot)$ 9.)
Intuition: large $h_{\rm pred}$ 0 means low disagreement → high confidence; low $h_{\rm pred}$ 1 means classifiers disperse → low confidence.
Pseudo‐labeling policies compare $h_{\rm pred}$ 2 against a threshold $h_{\rm pred}$ 3 (fixed, curriculum–adaptive, or transductive‐bound–based).

High‐level pseudocode $M$ 4 Differences in the three ψ‐policies:
- PL_θ: select all x with s_T(x)≥θ.
- CSTA_Δ: at iteration t choose θ^t as suitable quantile of {s_T(x)} to enforce |X_pl|≈Δ·|X_u|.
- MSTA: pick class‐specific thresholds by minimizing a transductive‐error bound: $h_{\rm pred}$ 4.
Theoretical Analysis (binary case, linear heads)
- Objective (Problem (P)) for $h_{\rm pred}$ 5:
$h_{\rm pred}$ 6

Assumption A: ∀m, $h_{\rm pred}$ 7.
Proposition 4.3 (loss properties): under A, $h_{\rm pred}$ 8 is strictly convex and coercive → unique global minimizer W*.
Stationary‐point Euler eq. ∇L(W)=0 reduces to a linear system (Proposition 4.4):

$h_{\rm pred}$ 9
Theorem 4.5 (lower bound on diversity): let $\mathcal T=\{h_m\}_{m=1}^M$ 0 be the solution, assume $\mathcal T=\{h_m\}_{m=1}^M$ 1. Then

$\mathcal T=\{h_m\}_{m=1}^M$ 2

In particular $\mathcal T=\{h_m\}_{m=1}^M$ 3, and high diversity ↔ large margins on labeled data.
Corollary 4.6 (role of representation): if all $\mathcal T=\{h_m\}_{m=1}^M$ 4, then

$\mathcal T=\{h_m\}_{m=1}^M$ 5

Thus, spreading labeled features evenly (large $\mathcal T=\{h_m\}_{m=1}^M$ 6) boosts diversity.

Experimental setup
- Datasets (13 SSL benchmarks):
  - Biological: Cod-RNA, DNA, Protein, Splice
  - Vision: COIL-20, Digits, MNIST
  - Tabular: DryBean, Mushrooms, Phishing, Rice, Svmguide1
  - Time series: HAR
- Labeling regimes:
  - IID: random class‐balanced sampling.
  - SSB: per‐class selection biased by $\mathcal T=\{h_m\}_{m=1}^M$ 7 on first principal component (strength hyperparameter $\mathcal T=\{h_m\}_{m=1}^M$ 8 tuned per dataset).
- Architecture & training:
  - A 3-layer MLP feature extractor $\mathcal T=\{h_m\}_{m=1}^M$ 9.
  - Prediction head $M$ 0 + ensemble of $M$ 1 linear heads.
  - Optimizer: Adam $M$ 2, 5 epochs ×100 iters/epoch.
  - Diversity strength $M$ 3, cross-entropy on ℓ sup, LS-SVM style on ensemble.
- Baselines:
  - ERM (supervised only)
  - PL_{θ=0.8} (fixed threshold 0.8)
  - CSTA_{Δ=0.4} (curriculum)
  - MSTA (transductive‐bound). Each with softmax‐confidence vs T-similarity.
- Metrics: test accuracy (%), calibration (Expected Calibration Error).
Quantitative results & ablations
- Failure of softmax under SSB: all four self-training schemes drop by up to 30 points vs IID; e.g. on Mushrooms softmax policies even underperform ERM.
- T-similarity gains under SSB:
  - PL_{θ=0.8}: +8/13 datasets
  - CSTA_{Δ=0.4}, MSTA: +11/13 datasets
  - Improvements up to ~18 points in worst‐case SSB.
- Under IID, T-similarity performs on par with softmax (no degradation).
- Confidence‐distribution plots: T-similarity concentrates high confidence on correct predictions, low on errors; softmax remains overconfident for both.
- Calibration (ECE) vs γ: imposing diversity (γ>0) steadily improves calibration in both IID & SSB.
- Ablations:
  - Pseudo‐label threshold θ∈{0.7,0.8,0.9,0.95}: T-sim robust across θ, softmax more brittle under SSB.
  - Labeled set size n_ℓ∈[20…2000]: T-sim outperforms softmax for small n_ℓ under SSB, matches softmax as n_ℓ grows.
  - Diversity weight γ∈{0,0.5,1,1.5,2}: any γ>0 helps under SSB; little harm under IID.
  - Ensemble size M∈{2,5,10}: M=5 is a good compromise; gains persist for larger M but with diminishing returns.
Practical recommendations & insights
- Plug‐and‐play: replace softmax‐confidence by T-similarity in any wrapper self-training.
- Modest overhead: only M=5 linear heads on top of fixed features.
- Choose γ>0 (even γ=0.5) to promote diversity.
- Ensemble size M≃5–10 suffices.
- Under distribution shift (SSB / covariate shift), T-similarity hugely stabilizes pseudo‐label quality.
- Ensure the learned feature space has spread (large λmin of Xℓ⊤X_ℓ) to maximize ensemble coverage (insight from Corollary 4.6)—akin to encouraging uniform embeddings.
- Connections to SCD: by explicitly encouraging classifier disagreement (negative‐correlation loss) on unlabeled data, T-similarity generalizes the SCD idea from two classifiers to a multi‐member ensemble, yielding a well-calibrated confidence measure for pseudo-labeling.

References to all equations and theorems are as numbered in the source paper.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-Augmented In-Context Learning (ICL) Scaling Law.

Retrieval-Augmented ICL Scaling Law

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics