Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrieval-Augmented ICL Scaling Law

Updated 9 April 2026
  • The method introduces T-similarity to compute reliable confidence scores, enhancing pseudo-label quality through ensemble diversity.
  • It optimizes an ensemble of linear classifiers by balancing cross-entropy minimization on labeled data with diversity promotion on unlabeled data.
  • Empirical results demonstrate that T-similarity significantly improves calibration and accuracy, especially under distribution shifts.

Below is a self‐contained exposition of the proposed T-similarity method for self-training and its relation to Self-Training with Classifier Disagreement (SCD). We follow the seven points requested:

  1. Self‐training loop
    • Initialization
      • We have a small labeled set (X,y)(\mathbf{X}_\ell,y_\ell) and a large unlabeled set Xu\mathbf{X}_u.
      • A neural network feature extractor fθ()f_\theta(\cdot) is randomly initialized (or pre‐trained).
      • On top of the fixed feature extractor we initialize
        • one “prediction” head hpredh_{\rm pred} for final classification (trained by standard cross‐entropy), and
        • an ensemble T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M of MM linear “confidence” heads.
    • Ensemble construction
      • Each linear head hmh_m is a weight vector ωmRd\omega_m\in\mathbb R^d predicting logits hm(x)=ωmfθ(x)h_m(x)=\omega_m^\top f_\theta(x).
      • We fit {ωm}\{\omega_m\} by minimizing a joint objective

    Xu\mathbf{X}_u0

    where the second term maximizes prediction diversity on Xu\mathbf{X}_u1. * Pseudo‐labeling via T-similarity * For each Xu\mathbf{X}_u2, compute

    Xu\mathbf{X}_u3

    the average pairwise cosine‐like similarity between classifiers’ output distributions. * Convert Xu\mathbf{X}_u4 to a confidence score and compare to a threshold Xu\mathbf{X}_u5 (fixed or adaptive). * If Xu\mathbf{X}_u6, assign Xu\mathbf{X}_u7 as pseudo‐label. * Incorporation and retraining * Move the selected Xu\mathbf{X}_u8 into the labeled set, remove it from Xu\mathbf{X}_u9. * Retrain fθ()f_\theta(\cdot)0 on the augmented labeled set; re-optimize the ensemble fθ()f_\theta(\cdot)1. * Repeat for up to fθ()f_\theta(\cdot)2 rounds or until fθ()f_\theta(\cdot)3 is exhausted.

  2. Mathematical definition of T-similarity
    • Let fθ()f_\theta(\cdot)4 be an ensemble of classifiers whose outputs lie in the probability simplex fθ()f_\theta(\cdot)5. Define

    fθ()f_\theta(\cdot)6

  • Equivalently, if each fθ()f_\theta(\cdot)7 is a probability vector,

    fθ()f_\theta(\cdot)8

    (Proposition: fθ()f_\theta(\cdot)9.)

  • Intuition: large hpredh_{\rm pred}0 means low disagreement → high confidence; low hpredh_{\rm pred}1 means classifiers disperse → low confidence.

  • Pseudo‐labeling policies compare hpredh_{\rm pred}2 against a threshold hpredh_{\rm pred}3 (fixed, curriculum–adaptive, or transductive‐bound–based).

  1. High‐level pseudocode MM4 Differences in the three ψ‐policies:

    • PL_θ: select all x with s_T(x)≥θ.
    • CSTA_Δ: at iteration t choose θt as suitable quantile of {s_T(x)} to enforce |X_pl|≈Δ·|X_u|.
    • MSTA: pick class‐specific thresholds by minimizing a transductive‐error bound: hpredh_{\rm pred}4.
  2. Theoretical Analysis (binary case, linear heads)
    • Objective (Problem (P)) for hpredh_{\rm pred}5:

    hpredh_{\rm pred}6

  • Assumption A: ∀m, hpredh_{\rm pred}7.
  • Proposition 4.3 (loss properties): under A, hpredh_{\rm pred}8 is strictly convex and coercive → unique global minimizer W*.
  • Stationary‐point Euler eq. ∇L(W)=0 reduces to a linear system (Proposition 4.4):

    hpredh_{\rm pred}9

  • Theorem 4.5 (lower bound on diversity): let T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M0 be the solution, assume T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M1. Then

    T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M2

    In particular T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M3, and high diversity ↔ large margins on labeled data.

  • Corollary 4.6 (role of representation): if all T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M4, then

    T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M5

    Thus, spreading labeled features evenly (large T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M6) boosts diversity.

  1. Experimental setup

    • Datasets (13 SSL benchmarks):
      • Biological: Cod-RNA, DNA, Protein, Splice
      • Vision: COIL-20, Digits, MNIST
      • Tabular: DryBean, Mushrooms, Phishing, Rice, Svmguide1
      • Time series: HAR
    • Labeling regimes:
      • IID: random class‐balanced sampling.
      • SSB: per‐class selection biased by T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M7 on first principal component (strength hyperparameter T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M8 tuned per dataset).
    • Architecture & training:
      • A 3-layer MLP feature extractor T={hm}m=1M\mathcal T=\{h_m\}_{m=1}^M9.
      • Prediction head MM0 + ensemble of MM1 linear heads.
      • Optimizer: Adam MM2, 5 epochs ×100 iters/epoch.
      • Diversity strength MM3, cross-entropy on ℓ sup, LS-SVM style on ensemble.
    • Baselines:
      • ERM (supervised only)
      • PL_{θ=0.8} (fixed threshold 0.8)
      • CSTA_{Δ=0.4} (curriculum)
      • MSTA (transductive‐bound). Each with softmax‐confidence vs T-similarity.
    • Metrics: test accuracy (%), calibration (Expected Calibration Error).
  2. Quantitative results & ablations
    • Failure of softmax under SSB: all four self-training schemes drop by up to 30 points vs IID; e.g. on Mushrooms softmax policies even underperform ERM.
    • T-similarity gains under SSB:
      • PL_{θ=0.8}: +8/13 datasets
      • CSTA_{Δ=0.4}, MSTA: +11/13 datasets
      • Improvements up to ~18 points in worst‐case SSB.
    • Under IID, T-similarity performs on par with softmax (no degradation).
    • Confidence‐distribution plots: T-similarity concentrates high confidence on correct predictions, low on errors; softmax remains overconfident for both.
    • Calibration (ECE) vs γ: imposing diversity (γ>0) steadily improves calibration in both IID & SSB.
    • Ablations:
      • Pseudo‐label threshold θ∈{0.7,0.8,0.9,0.95}: T-sim robust across θ, softmax more brittle under SSB.
      • Labeled set size n_ℓ∈[20…2000]: T-sim outperforms softmax for small n_ℓ under SSB, matches softmax as n_ℓ grows.
      • Diversity weight γ∈{0,0.5,1,1.5,2}: any γ>0 helps under SSB; little harm under IID.
      • Ensemble size M∈{2,5,10}: M=5 is a good compromise; gains persist for larger M but with diminishing returns.
  3. Practical recommendations & insights
    • Plug‐and‐play: replace softmax‐confidence by T-similarity in any wrapper self-training.
    • Modest overhead: only M=5 linear heads on top of fixed features.
    • Choose γ>0 (even γ=0.5) to promote diversity.
    • Ensemble size M≃5–10 suffices.
    • Under distribution shift (SSB / covariate shift), T-similarity hugely stabilizes pseudo‐label quality.
    • Ensure the learned feature space has spread (large λmin of Xℓ⊤X_ℓ) to maximize ensemble coverage (insight from Corollary 4.6)—akin to encouraging uniform embeddings.
    • Connections to SCD: by explicitly encouraging classifier disagreement (negative‐correlation loss) on unlabeled data, T-similarity generalizes the SCD idea from two classifiers to a multi‐member ensemble, yielding a well-calibrated confidence measure for pseudo-labeling.

References to all equations and theorems are as numbered in the source paper.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-Augmented In-Context Learning (ICL) Scaling Law.