Papers
Topics
Authors
Recent
Search
2000 character limit reached

Task2Vec Embeddings

Updated 17 February 2026
  • The paper introduces Task2Vec, which leverages Fisher Information estimates from a fixed pre-trained probe network to create vector representations of visual tasks.
  • It computes a diagonal, filter-wise approximation of the Fisher Information Matrix, enabling efficient quantification of task similarity for meta-learning and expert model selection.
  • Empirical results show that Task2Vec embeddings capture semantic relationships and facilitate near-oracle expert selection with reduced computational cost and data requirements.

Task2Vec provides a fixed-dimensional, vectorial embedding of visual classification tasks by leveraging estimates of the diagonal Fisher Information Matrix (FIM) of a pre-trained convolutional network, termed the 1. This embedding enables quantification of task similarity, facilitates meta-learning applications such as model selection, and is label-space-invariant, making it independent of explicit class semantics. Empirical analysis demonstrates that Task2Vec embeddings strongly reflect semantic and taxonomic relations between tasks and enable efficient expert selection in transfer learning contexts, matching or closely approximating oracle performance at a fraction of computational cost (Achille et al., 2019).

1. Formal Definition of Task2Vec Embedding

A "task" TT is specified by a labeled dataset D={(xi,yi)}i=1nD = \{ (x_i, y_i) \}_{i=1}^n. The core of Task2Vec is the use of a fixed, pre-trained "probe network" ϕ(θ):XZ\phi_{(\theta)}: X \rightarrow Z, commonly a ResNet-34 or DenseNet-121 model trained on ImageNet. The network parameters θ\theta remain fixed, and only a new linear or MLP head is trained on DD with a cross-entropy loss (ϕ(θ)(x),y)=logp(yϕ(θ)(x))\ell(\phi_{(\theta)}(x), y) = -\log p(y|\phi_{(\theta)}(x)).

Task2Vec constructs an embedding using the Fisher Information Matrix (FIM) computed with respect to the feature extractor parameters θ\theta:

F=E(x,y)D[g(x,y)g(x,y)T],g(x,y)=θ(ϕ(θ)(x),y).F = \mathbb{E}_{(x, y) \sim D} \big[ g(x, y) g(x, y)^T \big], \quad g(x, y) = \nabla_\theta \ell(\phi_{(\theta)}(x), y).

A diagonal, filter-wise approximation of FF is employed:

  • Only diagonal entries FjjF_{jj} are retained.
  • For convolutional filters θ={θf}f=1F\theta = \{ \theta_f \}_{f=1}^F, each filter's diagonal block is averaged,

ef=meanjfilter fFjj.e_f = \operatorname{mean}_{j \in \text{filter } f} F_{jj}.

The embedding is the FF-dimensional vector

E(T)=[e1,e2,...,eF]T.E(T) = [e_1, e_2, ..., e_F]^T.

To robustly estimate FF, a variational approach is used: parameter uncertainty is modeled with a Gaussian posterior N(θ^,Λ)\mathcal{N}(\hat\theta, \Lambda), yielding a surrogate loss

L(θ^;Λ)=EwN(θ^,Λ)[Hpw,p^]+βKL[N(0,Λ)N(0,λ2I)],L(\hat\theta; \Lambda) = \mathbb{E}_{w \sim \mathcal{N}(\hat\theta, \Lambda)}[H_{p_w, \hat p}] + \beta \cdot \text{KL}[\mathcal{N}(0, \Lambda) \| \mathcal{N}(0, \lambda^2 I)],

with a closed-form optimality condition showing Λ\Lambda tracks FF up to a regularization term.

2. Practical Computation Protocol

To operationalize Task2Vec:

  • The probe network ϕ(θ)\phi_{(\theta)} is fixed (e.g., pre-trained ResNet-34; DenseNet-121 yields similar performance; VGG-13 is suboptimal).
  • For a new task TT, a new classifier head is attached and trained for 2 epochs (Adam optimizer, learning rate 10410^{-4}, weight decay 51045\cdot 10^{-4}).
  • The head and variational diagonal variances Λ\Lambda are then jointly optimized by minimizing the surrogate loss L(θ^;Λ)L(\hat\theta; \Lambda) for a few additional epochs (learning rates: 102\sim 10^{-2} for Λ\Lambda, 10410^{-4} for head parameters).
  • Training uses mini-batches (e.g., size 64), with class-balanced sampling.
  • On each mini-batch, per-sample parameter gradients gig_i w.r.t. θ^\hat\theta are computed. Variances Λjj\Lambda_{jj} are updated using the Stochastic Gradient Variational Bayes estimator.
  • After optimization, filter-wise embedding coefficients are recovered as

ef(β/2n)Λf(βλ2/2n).e_f \approx (\beta/2n)\cdot\Lambda_f - (\beta\lambda^2/2n).

The overall scale of embeddings can vary, so normalization is applied during comparison.

The entire embedding process requires about $1$ head-training pass and $1$ SGVB pass per task (under $1$ GPU-hour per task).

3. Quantifying Task Similarity

Given two tasks TaT_a and TbT_b with embeddings EaE_a and EbE_b:

  • Symmetric distance (semantic similarity): normalize elementwise

Na=Ea/(Ea+Eb),Nb=Eb/(Ea+Eb)N_a = E_a / (E_a + E_b), \quad N_b = E_b / (E_a + E_b)

and define

dsym(Ta,Tb)=1Na,NbNaNb.d_{\text{sym}}(T_a, T_b) = 1 - \frac{ \langle N_a, N_b \rangle }{ \|N_a\|\|N_b\| }.

  • Asymmetric distance (transfer/model selection): define a "trivial" task T0T_0 with embedding proportional to the prior (E(T0)λ2IE(T_0) \propto \lambda^2 I), then

dasym(TaTb)=dsym(Ta,Tb)αdsym(Ta,T0),d_{\text{asym}}(T_a \to T_b) = d_{\text{sym}}(T_a, T_b) - \alpha d_{\text{sym}}(T_a, T_0),

with α0.15\alpha \sim 0.15 (for ResNet-34), rewarding more complex source tasks.

Empirical results show that dsymd_{\text{sym}} correlates strongly with semantic and taxonomic distances—e.g., bird order tasks cluster by canonical taxonomy; visual-semantic similarity is preserved for fine-grained and attribute-based datasets.

4. Meta-Learning and Expert Model Selection

A principal application of Task2Vec is "zero-shot" selection of pretrained expert models for novel tasks. Given a library {mj}j=1k\{ m_j\}_{j=1}^k of feature extractors, each associated with a training task TjT_j and an embedding EjE_j, one selects the expert that minimizes the asymmetric distance to the new task's embedding.

  • Zero-shot Task2Vec selection: Compute E(T)E(T) for the new task, score each expert with sj=dasym(TTj)s_j = -d_{\text{asym}}(T \to T_j), and select j=argmaxjsjj^* = \arg\max_j s_j.
  • Model2vec (co-embedding): Learn a per-expert bias bjb_j so "model embedding" Mj=E(Tj)+bjM_j = E(T_j) + b_j; then rank experts by

d(i)=[dasym(T(i)M1),...,dasym(T(i)Mk)]d^{(i)} = [d_{\text{asym}}(T^{(i)} \to M_1),...,d_{\text{asym}}(T^{(i)} \to M_k)]

and apply a softmax with temperature parameter over dj(i)-d^{(i)}_j. The bias parameters and temperature are trained to minimize cross-entropy with empirical expert performances.

This enables rapid, data-efficient matching of new tasks to highly compatible pretrained models without exhaustive retraining.

5. Experimental Findings and Empirical Properties

The evaluation of Task2Vec employed 1,460 fine-grained classification tasks from iNaturalist, CUB-200, iMaterialist, and DeepFashion. The expert library comprised 156 ResNet-34 models fine-tuned on specific tasks.

Key empirical results include:

  • Semantic alignment: dsymd_{\text{sym}} closely tracks average ultrametric taxonomic distance—tasks from the same taxon cluster together. t-SNE projections illustrate that CUB and iNaturalist bird tasks embed near one another; fashion attribute tasks group by visual semantics (e.g., "jeans" neighboring "denim").
  • Difficulty metric: The 1\ell_1 norm E(T)1\|E(T)\|_1 of Task2Vec correlates with task difficulty measured as misclassification error of the best expert.
  • Model selection: Task2Vec enables expert selection that yields error rates within 10%10\% (relative improvement) of optimal (oracle) while outperforming chance, ImageNet-only, and other baseline selectors for both homogeneous (e.g., iNat+CUB) and mixed tasks.
  • Data efficiency: With only 500 samples per task, Task2Vec-based expert selection followed by frozen-head training surpasses ImageNet+linear and ImageNet+fine-tuning approaches; this performance gap widens at lower data limits.
  • Probe architecture dependence: ResNet-34 and DenseNet-121 yield sub-10%10\% relative error in model selection, while VGG-13 suffers considerably higher error (+38%).

These results confirm the validity and utility of Task2Vec for rapid meta-learning and few-shot adaptation scenarios (Achille et al., 2019).

6. Distinguishing Properties and Limitations

Task2Vec embeddings are of fixed length, reflecting only the probe network structure and not the task label cardinality or semantics. This invariance enables fair comparison between disparate tasks and facilitates meta-learning workflows. The embedding cost is low (sublinear in the number of experts kk), eliminating the need for O(k)O(k) retraining and evaluation.

A limitation is the dependence on the choice of probe architecture; networks such as ResNet-34 and DenseNet-121 perform robustly but models with inferior representation quality (e.g., VGG-13) degrade similarity matching performance. The approach presumes availability of a representative probe network pretrained on a task distribution related to the downstream tasks. A plausible implication is that performance may be impacted if the probe network is poorly aligned with the class of target tasks.

7. Broader Implications and Applications

Task2Vec provides a mechanism for automated reasoning about task structure in deep learning, with immediate applications for transfer, curriculum, and continual learning where matching target tasks to prior trained models is essential. Its capacity to provide a label-independent, computationally efficient, and geometry-preserving task embedding makes it a fundamental tool in meta-learning pipelines, potentially informing future strategies for scalable neural architecture search, dataset-centric transfer evaluation, and automated machine learning frameworks (Achille et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Task2Vec Representations.