MTE: Model Transferability Estimation

Updated 30 November 2025

MTE is a framework that estimates the transfer potential of pre-trained models using proxy scores derived from target data and model features.
It employs both static and dynamic metrics—such as H-score, LEEP, LogME, and OTCE—to assess feature quality, class separability, and adaptation potential.
Applications span vision, language, speech, and medical imaging, reducing computational costs by selecting models with high predicted downstream performance.

Model Transferability Estimation (MTE) quantifies the prospective utility of pre-trained models for new target tasks without performing exhaustive fine-tuning. It enables efficient model selection from increasingly large model zoos by assigning each candidate a transferability score that correlates with expected downstream performance. MTE is essential in modern transfer learning pipelines for vision, language, speech, and other modalities. This article delineates the formal setting, algorithmic developments, evaluation protocols, and current challenges in MTE, with emphasis on frameworks and findings from recent literature.

1. Formal Definitions and Problem Statement

Let $\mathcal{M} = \{\phi_i\}_{i=1}^{N}$ denote a set of $N$ pre-trained source models. Given a target dataset $\mathcal{D}^T = \{(x_j, y_j)\}_{j=1}^n$ for a downstream task (e.g., classification, regression, segmentation), the performance of model $\phi_i$ on $\mathcal{D}^T$ after transfer (fine-tuning or adaptation) is denoted $R_i$ . As fine-tuning every $\phi_i$ is computationally infeasible at scale, Model Transferability Estimation aims to compute, for each $\phi_i$ , a transferability score $S_i = \mathcal{T}(\phi_i, \mathcal{D}^T)$ using only the pre-trained model and a small sample from $\mathcal{D}^T$ . The goal is that $S_i$ should be strongly correlated (in rank and, ideally, magnitude) with $R_i$ , formalized as maximizing a correlation metric such as Pearson’s $\rho$ , Kendall’s $\tau$ , or weighted Kendall’s $\tau_w$ between $(S_i)$ and $(R_i)$ (Ding et al., 23 Feb 2024).

Two main regimes are distinguished:

Source-Free MTE: Only the pre-trained model and the target dataset are available; source training data is inaccessible. The dominant regime in current practice.
Source-Dependent MTE: Both the pre-trained model and the corresponding source data are available, allowing joint source-target statistics (Ding et al., 23 Feb 2024).

2. Principled Methodologies and Representative Metrics

MTE methods are bifurcated into static and dynamic approaches and further classified by whether they exploit source data.

2.1 Static Source-Free MTE

These methods estimate transferability from the structure of the features or logits produced by the pre-trained model on the target data.

H-score [bao2019information]:

$H = \mathrm{Tr}(\Sigma^f)^{-1}\Sigma^z$

where $\Sigma^f$ is the total feature covariance, and $\Sigma^z$ is the covariance of class means. Shrinkage-based H-score improves stability and correlational power in high-dimensional regimes (Ibrahim et al., 2021).

LEEP [nguyen2020leep]:

$\text{LEEP} = \frac{1}{n}\sum_{i=1}^n \log\left(\sum_z p(z|x_i) p(y_i|z)\right)$

Offloads source head outputs as soft cluster assignments and assesses their alignment with target labels.

LogME [you2021logme]:

$\text{LogME} = \max_{\alpha, \beta} \log p(\mathbf{y} | X, \alpha, \beta)$

Considers the evidence (marginal log-likelihood) of a Bayesian linear regression head fit between features and target labels. Demonstrated robustness across modalities and scenarios (Ding et al., 23 Feb 2024).

TransRate (Huang et al., 2021):

$\text{TrR}(g; \varepsilon) = R(\widehat{Z}, \varepsilon) - R(\widehat{Z}, \varepsilon | Y)$

This is a mutual information proxy based on efficient coding-rate approximations from target features and labels.

GBC [pandy2022bhattacharyya]: Measures the average Bhattacharyya distance between class-conditional feature distributions, quantifying overlap and thus transfer difficulty.

Static metrics additionally include variants focused on class separability (Separation Index, RankMe), or similarity to random features or idealized target geometry (Singh et al., 10 Feb 2025, Guo, 1 May 2024).

2.2 Dynamic Source-Free MTE

Energy models: Evaluate the “in-distribution” likelihood or energy of target data under the pre-trained model (e.g., ETran [gholami2023etran], PED [li2023exploring]).
Linear Proxy Frameworks: Simulate adaptation by fitting lightweight heads:
- LogME, T-LogME, SFDA extend evidence estimates or Fisher discriminant analysis to encode both label fit and feature adequacy (Singh et al., 10 Feb 2025, Fouquet et al., 2023).
- Implicit Transferability Modeling (ITM) (Zheng et al., 27 Oct 2025) models latent adaptation dynamics via closed-form embedding updates over pseudo-clusters, capturing the evolution of embeddings in a fine-tuning proxy.
Model/Task Vectorization: Encode both model architecture and task summaries in a shared space (e.g., SynLearn, Model Spider [zhang2023model]).

2.3 Source-Dependent MTE and Optimal Transport

When source data is available, MTE can exploit joint distributions:

OTCE (Optimal Transport Conditional Entropy) (Tan et al., 2021, Tan et al., 2021):

$\text{OTCE}(S \rightarrow T) = -H(Y_T|Y_S)$

after computing an optimal coupling between source and target pixel/embedding distributions via entropic OT.

JC-NCE (Tan et al., 2021): Refines OTCE by building correspondences on both feature and label distribution distances, then measuring negative conditional entropy over the OT coupling—substantially improving performance under cross-domain shifts.
Ensemble-Based Metrics: Extending single-source analysis, ensemble-specific extensions of LEEP (e.g., SoftIoU-EEP, E-LEEP) allow selection of sets of source models optimizing ensemble transfer (Agostinelli et al., 2021).

3. Specialized MTE: Dense Prediction, Regression, and Speech

MTE frameworks have been adapted to structured-output and other modalities beyond image classification.

Medical Image Segmentation:

The CC-FV (Class Consistency-Feature Variety) framework scores a segmentation backbone by combining intra-class feature compactness (via Wasserstein distances) and global feature diversity (hyperspherical energy), using multi-scale decoder fusions (Yang et al., 2023). It delivers higher Kendall’s $\tau$ and Pearson $\rho$ compared to LEEP, LogME, GBC, and TransRate for 3D CT anatomical segmentation.

Semantic Segmentation and Object Detection:

Both OTCE and TLogME have been verified on semantic segmentation and detection, introducing domain-specific adaptations for pixel- or ROI-level features and integrating regression head evidence (Fouquet et al., 2023).

Regression Tasks:

Negative regularized mean-squared error of a ridge regression head on frozen source features—termed Linear MSE and Label MSE—provides efficient and theoretically justified transferability predictors with strong performance gains over LogME and TransRate (Nguyen et al., 2023).

Speech Models:

Score-based frameworks using Bayesian evidence (LogME) and optimal transport (sliced Wasserstein distance) compute layer-wise or model-wise transferability in pre-trained speech models (supervised and self-supervised), showing high rank correlation to WER and phoneme error rate after adaptation (Chen et al., 2023).

4. Evaluation Protocols, Benchmarking, and Comparative Insights

4.1 Metrics of Evaluation

Transferability estimators are principally evaluated by their rank-order correlation (Kendall’s $\tau$ , weighted $\tau_w$ for top-k alignment), Pearson’s $\rho$ , and sometimes top-k accuracy matching against ground-truth ranking (post fine-tuning) (Ding et al., 23 Feb 2024, Singh et al., 7 Oct 2025). Additional desiderata include numeric fidelity (score differences reflecting accuracy differences), robustness to model pool composition, and computational efficiency.

4.2 Benchmark Pitfalls and Guidelines

Recent meta-analyses reveal flaws in “static leaderboard” benchmarks:

Overly broad model pools dominated by largest architectures bias results in favor of trivial rankers (“pick the biggest model”).
Low rank dispersion and excessive model overlap inflate performance metrics, masking the inability of metrics to capture real task specificity (Singh et al., 7 Oct 2025).
Fidelity—how well score gaps reflect ground-truth accuracy disparities—must be directly examined, in addition to rank-based metrics.

Best practices for future benchmarks include:

Curating compute-matched, architecture-diverse model zoos.
Ensuring dataset “headroom” (far from accuracy saturation) and substantial task diversity.
Enforcing rank dispersion across both models and datasets.
Full release of code, precomputed scores, and accuracy matrices for reproducibility (Singh et al., 7 Oct 2025).

5. Empirical Findings, Recent Advances, and Meta-Learning

5.1 Large-Scale Comparative Studies

Stability: Aggregate analyses over >700,000 transfer experiments demonstrate that LogME is the most stable for dataset-selection in segmentation, N-LEEP dominates for architecture selection in classification, and GBC excels for ranking target subtasks—yet no metric is uniformly superior (Agostinelli et al., 2022).
Medical and Surgical Applications: In surgical phase recognition, LogME with minimum-per-video aggregation robustly aligns with fine-tuning accuracy when the model pool is diverse; otherwise, discriminative power degrades (Singh et al., 22 Aug 2025).
Robustness via Perturbation: Feature-space perturbation (Spread–Attract framework) that intentionally increases intra-class variability and blurs inter-class boundaries systematically enhances the robustness and accuracy of any MTE metric by penalizing non-robust feature geometries (Khoba et al., 23 Feb 2025).

5.2 Recent Algorithmic Developments

Kernel and Simplicity-Based Scores: Kite combines centered kernel alignment to the ideal label kernel and to random features, delivering state-of-the-art correlation at minimal computational cost (Guo, 1 May 2024). Occam’s model metrics (INT, Concept Variance) directly quantify embedding “regularity” and outperform baselines in diverse classification tasks (Singh et al., 10 Feb 2025).
Neural Collapse–Inspired Methods: FaCe aggregates variance-collapse and class-fairness terms to measure how pre-adapted a model’s features are for neural collapse, yielding strong results for image, segmentation, and text classification (Ding et al., 2023).
Implicit Evolution Models: ITM learns latent variables governing embedding evolution during fine-tuning, using divide-and-conquer variational updates. This approach generalizes across CNN, ViT, and both supervised and self-supervised pretraining, yielding superior ranking accuracy at moderate compute cost (Zheng et al., 27 Oct 2025).
Meta-Learning for Metric Selection: MetaRank encodes both dataset and metric textual descriptions into a shared semantic space and learns a meta-predictor to select the MTE metric most likely to perform best for a given new task, outperforming any single fixed metric (Liu et al., 26 Nov 2025).

6. Limitations, Open Challenges, and Future Directions

Despite substantial progress, major open problems remain:

Robustness to Experimental Choices: MTE scores are sensitive to conceptual and practical factors—feature normalization, target class numbers, data subsampling, and fine-tuning hyperparameters can all reverse metric rankings (Agostinelli et al., 2022, Singh et al., 7 Oct 2025).
Unified Benchmarks: Absence of large, standardized, multi-modal, multi-architecture benchmarks limits definitive comparisons and hinders progress (Ding et al., 23 Feb 2024, Singh et al., 7 Oct 2025).
Beyond Supervised Classification: Few metrics handle transfer for detection, regression, sequence tagging, or when minimal/no labels are available for the target.
Scaling to Foundation Models: The applicability of current MTE techniques to extremely large models across vision, text, audio, and multi-modal regimes (e.g., CLIP, LLMs) is largely unexplored; efficient, label-free proxies for such models are a pressing need (Ding et al., 23 Feb 2024).
Unsupervised and Weak-Label Regimes: Recent metrics such as RankMe and variants of OTCE point toward progress in label-free MTE, but further generalization is needed.

7. Practical Recommendations and Summary Table

Practitioners should consider the following:

Scenario	Recommended MTE Metric(s)	Notes
Vision classification	LogME, H-score (shrinkage), Occam (INT/CV), Kite	For large model zoos; INT/CV and Kite offer speed
Medical image segmentation	CC-FV, OTCE (if source data accessible)	Wasserstein & feature diversity essential
Dense prediction	TLogME (object detection), OTCE (segmentation)	Incorporate regression head when possible
Regression	Linear MSE, Label MSE (Nguyen et al., 2023)	Ridge regression on frozen source features
Ensembles	SoftIoU-EEP, E-LEEP (Agostinelli et al., 2021)	Ensemble-specific extensions outperform singles
Model zoo with many tasks	LogME, N-LEEP, GBC (task dependent)	No one-size-fits-all; benchmark to validate
Metric selection	MetaRank (Liu et al., 26 Nov 2025)	Task-aware meta-ranking of metrics

A strategic approach is to use multiple metrics, emphasize benchmark and model-pool diversity, validate top choices by partial fine-tuning, and—where possible—incorporate feature perturbation or complexity/simplicity regularization. Future MTE frameworks are expected to jointly address scalability, robustness, modality generality, and effective unsupervised guidance.

References:

Survey: "Which Model to Transfer? A Survey on Transferability Estimation" (Ding et al., 23 Feb 2024) Medical Segmentation: "Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation" (Yang et al., 2023) Segmentation OTCE: "Transferability Estimation for Semantic Segmentation Task" (Tan et al., 2021) Shrinkage H-score/LogME: "Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance" (Ibrahim et al., 2021) JC-NCE: "Practical Transferability Estimation for Image Classification Tasks" (Tan et al., 2021) Occam's Model: "Occam's model: Selecting simpler representations for better transferability estimation" (Singh et al., 10 Feb 2025) Benchmark critique: "How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation" (Singh et al., 7 Oct 2025) Kite: "KITE: A Kernel-based Improved Transferability Estimation Method" (Guo, 1 May 2024) Neural Collapse/FaCe: "Unleashing the power of Neural Collapse for Transferability Estimation" (Ding et al., 2023) Speech models: "How to Estimate Model Transferability of Pre-Trained Speech Models?" (Chen et al., 2023) TransRate: "Frustratingly Easy Transferability Estimation" (Huang et al., 2021) Feature Perturbation: "Feature Space Perturbation: A Panacea to Enhanced Transferability Estimation" (Khoba et al., 23 Feb 2025) Large-scale stability: "How stable are Transferability Metrics evaluations?" (Agostinelli et al., 2022) MetaRank: "MetaRank: Task-Aware Metric Selection for Model Transferability Estimation" (Liu et al., 26 Nov 2025) Implicit modeling: "Implicit Modeling for Transferability Estimation of Vision Foundation Models" (Zheng et al., 27 Oct 2025) Regression: "Simple Transferability Estimation for Regression Tasks" (Nguyen et al., 2023).