Log Expected Empirical Prediction (LEEP)
- LEEP is a transferability metric that quantifies the expected log-likelihood a predictor, derived from a source model's outputs, assigns to target labels.
- It computes empirical joint, marginal, and conditional distributions via a single forward pass, ensuring efficient evaluation of source-target pairings.
- High LEEP scores correlate with faster convergence and better transfer performance, guiding model selection and fine-tuning decisions.
The Log Expected Empirical Prediction (LEEP) is a transferability metric for assessing how effectively the learned representations of a source classifier can be transferred to a target task without requiring any retraining. LEEP quantifies the expected log-likelihood that an empirically constructed predictor—based on the source model’s outputs and the target labels—would assign to the target labels. It is utilized for model selection, source-target pairing in transfer learning, and estimating the likely accuracy and convergence speed of downstream transfer without incurring the cost of model fine-tuning. LEEP has also been adapted under distinct mathematical formulations in both transfer learning and small area estimation. The following sections detail its formal definition, computational workflow, theoretical properties, empirical evaluation, comparative analysis, practical guidelines, and limitations.
1. Formal Definition
Given a pre-trained source classifier outputting categorical probabilities over the source label set , and a labeled target data set with , define:
- Dummy-label prediction: for .
- Empirical joint :
- Empirical marginal over :
- Empirical conditional of given :
The Expected Empirical Predictor (EEP) for given is:
The LEEP score is the average log-likelihood of the EEP on :
LEEP values reside in , with higher (less negative) values corresponding to greater expected transferability (Nguyen et al., 2020, Wong et al., 2022).
2. Algorithmic Computation
Computing LEEP for a given source model and target set involves the following steps:
- Forward pass all through to obtain for all .
- Compute empirical joint for each :
- Compute marginals/conditionals: , .
- Compute per-instance scores: .
- Aggregate: .
This process involves only a single forward pass of the target data through the source model, and additional arithmetic for joint distributions (Nguyen et al., 2020, Wong et al., 2022).
3. Theoretical Properties and Bounds
LEEP is theoretically bounded between a single-label assignment baseline and the optimal (oracle) retrained classifier head:
- Upper bound: If with a frozen feature extractor and head , and is the maximum-likelihood head on fixing ,
- Lower bound: Let (“hard” label), and define the Negative Conditional Entropy (NCE)
Then,
These theoretical results demonstrate that LEEP interpolates between “hard” assignment metrics and the best possible retrained head log-likelihood on the target (Nguyen et al., 2020).
4. Empirical Performance and Correlation with Transfer Success
LEEP performance has been evaluated in settings including large-scale visual transfer (ImageNet → CIFAR-100), small-data and imbalanced regimes, meta-transfer (CNAPs), and RF domain adaptation:
- Large data: ImageNet→CIFAR-100 head-retraining, Pearson ; CIFAR-10→CIFAR-100, .
- Few-shot/imbalanced: $5$ classes examples, LEEP correlates positively (), maintains significance with label noise or class imbalance.
- Meta-transfer (CNAPs): On 200 random 5-way, 50-shot CIFAR-100 tasks, .
- RF domain adaptation: Across SNR/FO-shifted domains, LEEP correlates with head-retrain accuracy (–$0.82$), high correlation with LogME (–$0.90$), near-optimal source model selection in over 80% of cases (Nguyen et al., 2020, Wong et al., 2022).
- Convergence: Models in higher-LEEP bins converge faster and surpass scratch-trained accuracy with fewer epochs.
5. Comparative Analysis with Baselines
LEEP has been compared against Negative Conditional Entropy (NCE) [Tran et al., ICCV’19] and H score [Bao et al., ICIP’19]:
- LEEP matches or exceeds NCE’s Pearson in most settings with up to 30% relative improvement (e.g., improves from $0.715$ to $0.798$).
- H score frequently fails to produce statistically significant correlations (11/23 cases) and never surpasses LEEP on large-data benchmarks.
- In the representative ImageNet→CIFAR-100 head-retraining setting, LEEP achieves , outperforming the H score’s $0.924$ (Nguyen et al., 2020).
6. Practical Guidelines for Application
- Efficiency: LEEP requires only one forward pass per (source model, target data) pair.
- Minimal data: Robust to small, imbalanced, or noisy target sets provided there are at least several examples per class to reliably estimate empirical .
- Utilities:
- Rank source models for transfer (model zoo selection).
- Screen source-target pairings for joint/multi-task grouping.
- Guide decisions on fine-tuning necessity and anticipate convergence rates.
- Domain transferability: LEEP is not symmetric; transfer from hard→easy yields higher scores than the reverse.
- RF applications: In scenarios such as modulation classification, LEEP aligns with domain proximity (SNR/FO), and can guide rapid source selection without retraining (Nguyen et al., 2020, Wong et al., 2022).
7. Limitations and Considerations
- Data sparsity: With very few examples per target class (e.g., ), empirical estimates become unreliable, raising LEEP’s variance.
- Source model dependency: LEEP accuracy presumes a reasonably well-trained source classifier.
- Feature usage: LEEP operates on softmax outputs; it does not explicitly exploit intermediate feature activations.
- Architectural effects: It may not capture nuanced behaviors when fine-tuning is highly architecture- or hyperparameter-sensitive.
- Scope: Requires softmax-based source models and compatible input spaces; does not generalize to non-classification or regression tasks without modification (Nguyen et al., 2020, Wong et al., 2022).
LEEP stands as a theoretically grounded, computationally efficient, and empirically validated metric for assessing source model transferability across numerous domains and regimes. Its design, tight theoretical bounds, and high empirical correlation with actual transfer performance make it a practical decision metric in both research and applied transfer learning settings (Nguyen et al., 2020, Wong et al., 2022).