Task2Vec Embeddings

Updated 17 February 2026

The paper introduces Task2Vec, which leverages Fisher Information estimates from a fixed pre-trained probe network to create vector representations of visual tasks.
It computes a diagonal, filter-wise approximation of the Fisher Information Matrix, enabling efficient quantification of task similarity for meta-learning and expert model selection.
Empirical results show that Task2Vec embeddings capture semantic relationships and facilitate near-oracle expert selection with reduced computational cost and data requirements.

Task2Vec provides a fixed-dimensional, vectorial embedding of visual classification tasks by leveraging estimates of the diagonal Fisher Information Matrix (FIM) of a pre-trained convolutional network, termed the ^{^{^{^{1^{^{^{^.}}}}}}} This embedding enables quantification of task similarity, facilitates meta-learning applications such as model selection, and is label-space-invariant, making it independent of explicit class semantics. Empirical analysis demonstrates that Task2Vec embeddings strongly reflect semantic and taxonomic relations between tasks and enable efficient expert selection in transfer learning contexts, matching or closely approximating oracle performance at a fraction of computational cost (Achille et al., 2019).

1. Formal Definition of Task2Vec Embedding

A "task" $T$ is specified by a labeled dataset $D = \{ (x_i, y_i) \}_{i=1}^n$ . The core of Task2Vec is the use of a fixed, pre-trained "probe network" $\phi_{(\theta)}: X \rightarrow Z$ , commonly a ResNet-34 or DenseNet-121 model trained on ImageNet. The network parameters $\theta$ remain fixed, and only a new linear or MLP head is trained on $D$ with a cross-entropy loss $\ell(\phi_{(\theta)}(x), y) = -\log p(y|\phi_{(\theta)}(x))$ .

Task2Vec constructs an embedding using the Fisher Information Matrix (FIM) computed with respect to the feature extractor parameters $\theta$ :

$F = \mathbb{E}_{(x, y) \sim D} \big[ g(x, y) g(x, y)^T \big], \quad g(x, y) = \nabla_\theta \ell(\phi_{(\theta)}(x), y).$

A diagonal, filter-wise approximation of $F$ is employed:

Only diagonal entries $F_{jj}$ are retained.
For convolutional filters $\theta = \{ \theta_f \}_{f=1}^F$ , each filter's diagonal block is averaged,

$e_f = \operatorname{mean}_{j \in \text{filter } f} F_{jj}.$

The embedding is the $F$ -dimensional vector

$E(T) = [e_1, e_2, ..., e_F]^T.$

To robustly estimate $F$ , a variational approach is used: parameter uncertainty is modeled with a Gaussian posterior $\mathcal{N}(\hat\theta, \Lambda)$ , yielding a surrogate loss

$L(\hat\theta; \Lambda) = \mathbb{E}_{w \sim \mathcal{N}(\hat\theta, \Lambda)}[H_{p_w, \hat p}] + \beta \cdot \text{KL}[\mathcal{N}(0, \Lambda) \| \mathcal{N}(0, \lambda^2 I)],$

with a closed-form optimality condition showing $\Lambda$ tracks $F$ up to a regularization term.

2. Practical Computation Protocol

To operationalize Task2Vec:

The probe network $\phi_{(\theta)}$ is fixed (e.g., pre-trained ResNet-34; DenseNet-121 yields similar performance; VGG-13 is suboptimal).
For a new task $T$ , a new classifier head is attached and trained for 2 epochs (Adam optimizer, learning rate $10^{-4}$ , weight decay $5\cdot 10^{-4}$ ).
The head and variational diagonal variances $\Lambda$ are then jointly optimized by minimizing the surrogate loss $L(\hat\theta; \Lambda)$ for a few additional epochs (learning rates: $\sim 10^{-2}$ for $\Lambda$ , $10^{-4}$ for head parameters).
Training uses mini-batches (e.g., size 64), with class-balanced sampling.
On each mini-batch, per-sample parameter gradients $g_i$ w.r.t. $\hat\theta$ are computed. Variances $\Lambda_{jj}$ are updated using the Stochastic Gradient Variational Bayes estimator.
After optimization, filter-wise embedding coefficients are recovered as

$e_f \approx (\beta/2n)\cdot\Lambda_f - (\beta\lambda^2/2n).$

The overall scale of embeddings can vary, so normalization is applied during comparison.

The entire embedding process requires about $1$ head-training pass and $1$ SGVB pass per task (under $1$ GPU-hour per task).

3. Quantifying Task Similarity

Given two tasks $T_a$ and $T_b$ with embeddings $E_a$ and $E_b$ :

Symmetric distance (semantic similarity): normalize elementwise

$N_a = E_a / (E_a + E_b), \quad N_b = E_b / (E_a + E_b)$

and define

$d_{\text{sym}}(T_a, T_b) = 1 - \frac{ \langle N_a, N_b \rangle }{ \|N_a\|\|N_b\| }.$

Asymmetric distance (transfer/model selection): define a "trivial" task $T_0$ with embedding proportional to the prior ( $E(T_0) \propto \lambda^2 I$ ), then

$d_{\text{asym}}(T_a \to T_b) = d_{\text{sym}}(T_a, T_b) - \alpha d_{\text{sym}}(T_a, T_0),$

with $\alpha \sim 0.15$ (for ResNet-34), rewarding more complex source tasks.

Empirical results show that $d_{\text{sym}}$ correlates strongly with semantic and taxonomic distances—e.g., bird order tasks cluster by canonical taxonomy; visual-semantic similarity is preserved for fine-grained and attribute-based datasets.

4. Meta-Learning and Expert Model Selection

A principal application of Task2Vec is "zero-shot" selection of pretrained expert models for novel tasks. Given a library $\{ m_j\}_{j=1}^k$ of feature extractors, each associated with a training task $T_j$ and an embedding $E_j$ , one selects the expert that minimizes the asymmetric distance to the new task's embedding.

Zero-shot Task2Vec selection: Compute $E(T)$ for the new task, score each expert with $s_j = -d_{\text{asym}}(T \to T_j)$ , and select $j^* = \arg\max_j s_j$ .
Model2vec (co-embedding): Learn a per-expert bias $b_j$ so "model embedding" $M_j = E(T_j) + b_j$ ; then rank experts by

$d^{(i)} = [d_{\text{asym}}(T^{(i)} \to M_1),...,d_{\text{asym}}(T^{(i)} \to M_k)]$

and apply a softmax with temperature parameter over $-d^{(i)}_j$ . The bias parameters and temperature are trained to minimize cross-entropy with empirical expert performances.

This enables rapid, data-efficient matching of new tasks to highly compatible pretrained models without exhaustive retraining.

5. Experimental Findings and Empirical Properties

The evaluation of Task2Vec employed 1,460 fine-grained classification tasks from iNaturalist, CUB-200, iMaterialist, and DeepFashion. The expert library comprised 156 ResNet-34 models fine-tuned on specific tasks.

Key empirical results include:

Semantic alignment: $d_{\text{sym}}$ closely tracks average ultrametric taxonomic distance—tasks from the same taxon cluster together. t-SNE projections illustrate that CUB and iNaturalist bird tasks embed near one another; fashion attribute tasks group by visual semantics (e.g., "jeans" neighboring "denim").
Difficulty metric: The $\ell_1$ norm $\|E(T)\|_1$ of Task2Vec correlates with task difficulty measured as misclassification error of the best expert.
Model selection: Task2Vec enables expert selection that yields error rates within $10\%$ (relative improvement) of optimal (oracle) while outperforming chance, ImageNet-only, and other baseline selectors for both homogeneous (e.g., iNat+CUB) and mixed tasks.
Data efficiency: With only 500 samples per task, Task2Vec-based expert selection followed by frozen-head training surpasses ImageNet+linear and ImageNet+fine-tuning approaches; this performance gap widens at lower data limits.
Probe architecture dependence: ResNet-34 and DenseNet-121 yield sub- $10\%$ relative error in model selection, while VGG-13 suffers considerably higher error (+38%).

These results confirm the validity and utility of Task2Vec for rapid meta-learning and few-shot adaptation scenarios (Achille et al., 2019).

6. Distinguishing Properties and Limitations

Task2Vec embeddings are of fixed length, reflecting only the probe network structure and not the task label cardinality or semantics. This invariance enables fair comparison between disparate tasks and facilitates meta-learning workflows. The embedding cost is low (sublinear in the number of experts $k$ ), eliminating the need for $O(k)$ retraining and evaluation.

A limitation is the dependence on the choice of probe architecture; networks such as ResNet-34 and DenseNet-121 perform robustly but models with inferior representation quality (e.g., VGG-13) degrade similarity matching performance. The approach presumes availability of a representative probe network pretrained on a task distribution related to the downstream tasks. A plausible implication is that performance may be impacted if the probe network is poorly aligned with the class of target tasks.

7. Broader Implications and Applications

Task2Vec provides a mechanism for automated reasoning about task structure in deep learning, with immediate applications for transfer, curriculum, and continual learning where matching target tasks to prior trained models is essential. Its capacity to provide a label-independent, computationally efficient, and geometry-preserving task embedding makes it a fundamental tool in meta-learning pipelines, potentially informing future strategies for scalable neural architecture search, dataset-centric transfer evaluation, and automated machine learning frameworks (Achille et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Task2Vec: Task Embedding for Meta-Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Task2Vec Representations.