Multitask Representation Learning

Updated 10 December 2025

Multitask representation learning is an approach that jointly learns shared features from multiple tasks, improving sample efficiency, prediction accuracy, and generalization.
It employs techniques such as alternating minimization, trace-norm regularization, and self-supervised objectives to recover low-dimensional structures across diverse tasks.
Practical applications have demonstrated significant gains in sample complexity, regret minimization, and transfer performance in areas like bandits, reinforcement learning, and multimodal fusion.

Multitask representation learning (MRL) refers to algorithms and frameworks in which representations—typically feature maps, neural embeddings, or latent projections—are learned jointly across multiple tasks, with the aim of leveraging shared structure to improve sample efficiency, prediction accuracy, and generalization. When the tasks share an underlying low-dimensional representation but differ in their task-specific objectives or outputs, MRL exploits this structure to identify representations that transfer across tasks and domains. MRL is foundational in contexts ranging from contextual bandits (Lin et al., 2 Oct 2024) and linear/nonlinear MDPs (Lu et al., 1 Mar 2025, Lu et al., 2022, Cheng et al., 2022, Lu et al., 2021), to multi-task supervised learning, multimodal fusion, and deep metric/subspace learning.

1. Formal Problem Setups and Representation Assumptions

The canonical setting for multitask representation learning involves $T$ tasks, each with its own data and objective. The tasks may be supervised (classification, regression), unsupervised (self-supervised), or sequential decision problems (contextual bandits, reinforcement learning). The critical assumption is that task-specific mappings (parameters, policies, value functions, regression vectors, etc.) are not independent, but instead can be factorized through a shared representation $\phi$ or feature extractor. Typical formalizations include:

Shared Linear Subspace: For $T$ contextual bandit tasks each defined by parameters $\theta^*_t\in\mathbb{R}^d$ , form $\Theta^*=[\theta^*_1,\ldots,\theta^*_T]\in\mathbb{R}^{d\times T}$ and posit $\text{rank}(\Theta^*)=r\ll d,T$ ; i.e. there exists $B^*\in\mathbb{R}^{d\times r}$ and $W^*\in\mathbb{R}^{r\times T}$ with $\Theta^*=B^* W^*$ (Lin et al., 2 Oct 2024, Cella et al., 2022).
General Nonlinear Representation Classes: Each task's function $f^{(i)}$ is modeled as (approximately) linear in $\phi(x)$ , the shared feature map from class $\Phi$ , but $\Phi$ can be nonlinear (e.g. neural networks) (Lu et al., 1 Mar 2025, Lu et al., 2022, Collins et al., 2023).
Low-Rank MDPs: Transition kernels and reward functions in RL can be factorized via $\phi^*(s,a)$ and task-dependent coefficients, so value functions $Q^{(i)}_h(s,a)\approx\phi(s,a)^\top\theta^{(i)}$ (Cheng et al., 2022, Lu et al., 2021).
Hard Parameter Sharing in Supervised MTL: Shared encoder $\phi_{\theta_E}$ feeds into task-specific heads $\psi_{\theta_{T_k}}$ (Shin et al., 25 Sep 2024), enforcing representations used across tasks.

MRL algorithms enforce or exploit these shared representations explicitly (e.g. via trace-norm/low-rank regularization, co-clustering, alternating minimization) or implicitly (e.g. via hard parameter sharing, auxiliary self-supervised objectives, meta-learning routines).

2. Algorithms and Methodologies for Shared Representation Recovery

MRL techniques span several methodologies, distinguished by how the shared representation is recovered and exploited.

Alternating Minimization and Spectral Initialization: For linear contextual bandits, alternating projected gradient descent and closed-form estimation over feature subspaces ( $B$ ) and task coefficients ( $W$ ) robustly recover the shared low-rank representation (Lin et al., 2 Oct 2024).
Trace-Norm Bandit Algorithms: In stochastic linear bandits, trace-norm regularization penalizes the rank of the task-parameter matrix $W$ , enforcing low-dimensional structures without explicit rank knowledge (Cella et al., 2022).
Functional Confidence Sets (GFUCB): The Generalized Functional Upper Confidence Bound algorithm constructs confidence balls jointly over the shared representation class $\Phi$ and all tasks, enabling sample-efficient exploration in general nonlinear settings (Lu et al., 1 Mar 2025, Lu et al., 2022).
Dummy Gradient Norm Regularization (DGR): Regularizes the norm of the gradient of task losses w.r.t. randomly initialized dummy predictors, incentivizing the encoder to yield more universal representations (Shin et al., 25 Sep 2024).
Co-Clustering Frameworks: Factorization of task weight matrix $W=FSG^\top$ clusters both features and tasks, yielding a latent co-clustered subspace optimal for multitask generalization (Murugesan et al., 2017).
Hierarchical Multi-Modal/Multi-Level Fusion: Stacking attention or fusion modules, sometimes at different network depths for different tasks, facilitates hierarchical shared representations in vision-language and multimodal setups (Nguyen et al., 2018, Weng et al., 2019).
Bootstrapped Latent Prediction: Self-supervised multitask RL methods predict future latent embeddings and use reversibility to ground representation learning in dynamics, yielding architectures like PBL (Guo et al., 2020).
Active Source Task Sampling: Empirically and theoretically, allocating samples from source tasks according to estimated relevance can yield sample complexity gains by a factor of the number of sources, surpassing uniform sampling (Chen et al., 2022).

These methodologies may be combined with task-specific regularizers, scheduled or curriculum-based training, and targeted transfer via context, metadata, or hierarchical attention.

3. Theoretical Guarantees and Sample Complexity Results

Recent work offers rigorous guarantees for MRL under various regimes.

Linear Bandit Regret: For $T$ tasks of dimension $d$ sharing a rank- $r$ subspace, multitask initialization and AltGDMin achieve regret $R_{N,T}\leq\tilde{O}(\sqrt{rNT})$ , outperforming single-task OFUL with regret $O(Td\sqrt{N})$ for $r\ll d,T$ (Lin et al., 2 Oct 2024). Similarly, trace-norm methods match minimax lower bounds $O(T\sqrt{rN}+\sqrt{rNTd})$ (Cella et al., 2022).
General Function Classes: The GFUCB framework yields regret bounds for bandits and linear MDPs of

$\tilde{O}\left(\sqrt{MdT [Mk + \ln N(\Phi, \alpha, \|\cdot\|_\infty)]}\right),$

providing quantified sample efficiency gains over $M$ independent runs, for any representation class $\Phi$ with bounded eluder dimension and covering number (Lu et al., 1 Mar 2025, Lu et al., 2022).

Transfer to New Tasks: Transfer regret for a new task $M+1$ using a learned shared $\Phi$ scales as

$\tilde{O}\left(\sqrt{T'd [k+\ln N(\Phi)]} + \Delta\right),$

where $\Delta$ captures representational mismatch; transfer succeeds as $\Delta\to 0$ with more source tasks (Lu et al., 1 Mar 2025).

Multitask RL Downstream Improvements: In generative linear MDPs, pretraining a shared $\phi$ over $NT$ samples reduces required samples for downstream tasks by a factor proportional to the square of the representation class’s Gaussian width $\mathcal{C}(\Phi)^2$ (Lu et al., 2021). In low-rank MDPs, the REFUEL algorithm demonstrates sample complexity reductions in both upstream model learning and downstream policy optimization, with downstream suboptimality dominated by upstream representation error and vanishing terms as downstream samples grow (Cheng et al., 2022).
Active Source Sampling: Adaptive algorithmic sampling—rounding up sources by relevance—yields a factor-of- $M$ gain in source sample complexity over uniform approaches, with provable risk bounds in linear models and empirical gains with CNNs (Chen et al., 2022).

4. Empirical Validation and Benchmark Comparisons

MRL frameworks consistently report improved efficiency and accuracy across benchmarks in synthetic, vision, NLP, and RL domains.

Bandits and Linear RL: AltGDMin and trace-norm algorithms rapidly converge and reduce regret in synthetic and MNIST-based contextual-bandit tasks (Lin et al., 2 Oct 2024, Cella et al., 2022).
Nonlinear RL: GFUCB empirically achieves lower regret and better transfer with neural $\Phi$ , outpacing separate learning (Lu et al., 1 Mar 2025).
Supervised Hard Sharing: DGR provides statistically significant improvements in representation quality and multi-task prediction across UTKFace, NYUv2, CityScapes, and Pascal—robust gains observed with growing number of dummy decoders and across classifier types (Shin et al., 25 Sep 2024).
Multimodal Fusion: Multitask frameworks in pathology metadata prediction improve average AUC-ROC by 16.48% (TCGA) and 9.05% (TTH) relative to single-modal or single-task base models (Weng et al., 2019).
Vision-Language Hierarchical MTL: Dense co-attention architectures deliver significant gains on image-caption retrieval (Flickr30k, MSCOCO), VQA (VQA 2.0), and visual grounding (Flickr30k Entities), especially in the hardest metrics (R@1) (Nguyen et al., 2018).
Reinforcement Learning: Multitask shared-value-function learning (MT-FQI/MT-PI) yields faster convergence (~50–100 iterations) and higher policy returns with fewer samples versus single-task methods (Borsa et al., 2016). Composable context-based representations with metadata (CARE) achieve state-of-the-art sample efficiency and meta-world multitask robotic control (Sodhani et al., 2021).

5. Extensions, Limitations, and Open Directions

MRL theory and practice continues to evolve across several current frontiers:

General Function Classes: Extending theoretical guarantees to deep nonlinear $\Phi$ with tight complexity bounds remains challenging; existing results depend on loose eluder/covering numbers (Lu et al., 1 Mar 2025, Lu et al., 2022, Collins et al., 2023).
Online and Nonstationary Contexts: Adaptation to drifting or adversarial task families, nonstationary environments, and online subspace tracking is largely open (Qin et al., 2022).
Auxiliary and Self-Supervised Regularization: The integration of auxiliary decoders (DGR) and self-supervised objectives further enhances universality of representations but may require convexity and careful hyperparameter selection (Shin et al., 25 Sep 2024).
Active Source Selection: Sample allocation across heterogeneous source tasks is central for scalability; empirical active sampling shows robust gains, but theory is developed only for linear settings (Chen et al., 2022).
Multimodal and Hierarchical Fusion: Structuring representations for multimodal data (e.g., text, image, structured features) and hierarchical multi-task networks (e.g., attaching heads at multiple network depths) is effective but hyperparameter-sensitive (Weng et al., 2019, Nguyen et al., 2018).
Transfer and Zero-Shot Generalization: MRL often improves transfer to novel tasks, with precision depending on representation coverage ( $\Delta$ ) and metadata context integration (Sodhani et al., 2021).
Computational Oracles and Optimization: Some optimal algorithms depend on intractable oracles or require block coordinate descent and non-convex minimization (e.g., GFUCB); efficient approximations remain to be developed.

6. Practical Guidelines and Implications

The accumulated evidence from theory and empirical benchmarks suggests several guidelines for practitioners:

Applicable Regimes: MRL provides the most benefit when tasks share low-dimensional representations and per-task data is limited, $d\gg n$ or $d\gg T$ , and when representations are transferable across tasks or domains (Maurer et al., 2015).
Choice of Regularization and Model Architecture: For hard parameter sharing, regularize for universality (DGR, gradient norm); for subspace structure, apply trace-norm penalties or co-clustering. For nonlinear/deep formulations, use functional confidence balls and optimize empirical risk jointly.
Transfer and Sampling: For transfer, ensure the learned feature map adequately spans the relevant subspace for the target task; design sampling to minimize the least-activated-feature abundance ( $\kappa$ ) (Lu et al., 2021).
Hyperparameter Tuning: The number of shared encoders, epoch lengths, batch sizes, and regularization weights profoundly impact learning efficiency; select via cross-validation on held-out or proxy domains (Shin et al., 25 Sep 2024, Nguyen et al., 2018).
Interpretability and Modularity: Hierarchical and composable representations facilitate interpretability, cluster semantically related tasks, and allow for context-sensitive adaptation (Sodhani et al., 2021, Weng et al., 2019).
Limitations: Fundamental constraints persist: nonconvex local minima, reliance on linear or mild nonlinearity, dependence on adequate source-task diversity, and tuning of representation complexity.

In summary, multitask representation learning acts as a statistically and computationally principled foundation for extracting reusable, efficient, and generalizable features across diverse tasks, modalities, and decision processes—offering quantifiable advantages in sample complexity, regret minimization, and transfer, while revealing ongoing challenges in handling nonlinearity, nonstationarity, and scale.