Multi-Task Reinforcement Learning

Updated 7 May 2026

Multi-task reinforcement learning is a paradigm that learns policies across multiple tasks simultaneously by exploiting shared state-action spaces and task-specific dynamics.
It leverages structural similarities among tasks with techniques like context representations, modular routing, and successor features to improve sample efficiency and avoid negative transfer.
Recent advances show scalable architectures achieving high success rates in benchmarks, emphasizing explicit policy guidance and normalization to balance diverse objectives.

Multi-task reinforcement learning (MTRL) denotes the study and development of reinforcement learning (RL) algorithms that aim to efficiently acquire policies or value functions across multiple tasks simultaneously. In MTRL, tasks are often modeled as Markov decision processes (MDPs) with either shared or partially shared state and action spaces but potentially differing reward functions and transition dynamics. MTRL seeks to exploit structural similarities among tasks to improve sample efficiency, avoid negative transfer, and support rapid adaptation or generalization to new or composite tasks.

1. Problem Formulations and Core Challenges

Let $\mathbb{T} = \{1, 2, ..., N\}$ denote a set of $N$ tasks, each defined by an MDP $(S, A, P_i, R_i, \gamma)$ sharing state and action spaces $(S, A)$ but with task-specific transitions $P_i$ and rewards $R_i$ . The multi-task RL objective is to maximize the average discounted return across all tasks:

$\max_{\{\pi_i\}_{i=1}^N}\;J(\{\pi_i\})\;=\;\frac1N\sum_{i=1}^N\mathbb E_{\tau\sim\pi_i,P_i}\left[\sum_{t=0}^{\infty}\gamma^tR_i(s_t,a_t)\right].$

Challenges emergent in this setting include:

Distribution mismatch and transfer: Tasks may vary significantly in their transition or reward dynamics, leading to risks of both positive and negative transfer.
Policy coupling versus specialization: Optimal policies in multi-task settings may require intrinsic stochasticity (unlike the single-task case), and naïve parameter sharing can lead to catastrophic interference (Zeng et al., 2020).
Scalability and sample efficiency: Explicitly leveraging structure—shared representations, successor features, or model-based priors—becomes critical as the number of tasks increases (Borsa et al., 2016, Landolfi et al., 2019).
Credit assignment and balancing: Reward scale mismatches and task saliency can bias updates, necessitating techniques for normalization or adaptive balancing (Hessel et al., 2018).
Task curriculum and transfer scheduling: The sequence and selection of tasks can affect convergence and the avoidance of negative transfer (Huang et al., 2022).

A dominant design in MTRL employs network architectures or policy parametrizations that share a significant subset of parameters across tasks. Methods in this family include:

Context-based representations: Policies are conditioned on a context variable (task descriptor or embedding), as in CARE, which mixes reusable encoders via context-driven attention, leading to significant robustness and zero-shot generalization on Meta-World MT10/MT50 (Sodhani et al., 2021).
Routing and modularization: Soft modularization uses differentiable routing networks to combine modules layer-wise, enabling task-dependent parameter reuse and decreasing direct interference (Yang et al., 2020). Empirical evaluations indicate that soft modular networks yield up to $2\times$ – $4\times$ acceleration compared to monolithic architectures in multi-task manipulation.
Explicit attention and sub-networks: Attentive multi-task RL leverages attention to group state representations into different sub-networks on a per-task and per-state basis, avoiding negative transfer when tasks conflict while maximizing positive transfer when possible (Bram et al., 2019).

A key empirical finding is that raw scaling of shared networks can, when paired with sufficient task diversity, outpace more complex MTRL-specific architectures. Performance in Meta-World MT10 and MT50 benchmarks increases from approximately $73\%$ (2M parameters) to over $N$ 0 (128M parameters) for a simple FF SAC baseline (McLean et al., 7 Mar 2025).

Architecture	Sample Efficiency (MT10/MT50)	Final Success (%)	Scaling Limits
Shared FF (small)	Lower	63–73	Good up to $N$ 110 tasks
Soft Modularization	Higher	61 (MT50), fast	Good up to 50 tasks
CARE (context representations)	Very High	84 (MT10), 54 (MT50)	Metadata and encoder tuning
Naïve Scaling (FF, large)	Competitive/Best	Up to 90	Requires large task diversity

3. Explicit Multi-Policy and Guidance Approaches

Beyond implicit sharing, more advanced frameworks explicitly treat the selection and combination of multiple policies:

Cross-task policy guidance (CTPG): This augments parameter-sharing approaches by introducing a learned "guide policy" $N$ 2 per task $N$ 3 that, every $N$ 4 steps, selects which task's control policy should act, thus directly providing expert trajectories for less-mastered tasks (He et al., 9 Jul 2025). Two gating mechanisms are included: a policy-filter gate, filtering out unhelpful advisors, and a guide-block gate, which blocks advice for mastered tasks based on entropy-temperature statistics.
Successor-feature-based value/policy composition: Successor Features (SFs) $N$ 5 enable decomposition of task-specific value functions and policies by their expected visitation of reward-relevant features. By composing SFs (e.g., through GPI or MSF approaches), new tasks with linearly parameterized rewards can be solved without retraining (Liu et al., 2023).
Model-based transfer and planning quasi-metrics: Approaches using learned dynamics models or distances (as in PQM) achieve rapid adaptation to new tasks either through virtual-policy warmup or through generalizable distance-based aiming (Landolfi et al., 2019, Micheli et al., 2020).

Empirical results indicate that explicit guidance, composition, and model-based regularization yield both higher sample efficiency and improved final returns, and can accelerate transfer by $N$ 6– $N$ 7 over naïve fine-tuning (He et al., 9 Jul 2025, Liu et al., 2023, Landolfi et al., 2019).

4. Multi-Objective and Constrained Multi-task RL

Increasingly, research in MTRL focuses on balancing multiple task objectives, with formal constraints or multi-criteria weighting:

Lagrangian-based action correction (TSAC): Policies are decomposed into a shared component (for dense rewards) and a corrective component (for sparse, goal-oriented rewards); a virtual budget with Lagrangian relaxation balances these objectives, producing state-of-the-art performance on Meta-World benchmarks (Feng et al., 2024).
Direct constrained optimization: Primal-dual natural policy gradient and actor-critic algorithms optimize the average return subject to minimum performance constraints per task, with sample-complexity guarantees that match those of single-task RL (Zeng et al., 2024).
Asymmetric curriculum-based MTRL: CAMRL combines a transfer matrix regularization, differentiable ranking losses, adaptive mode switching between curriculum and single-task learning, and automatic loss weighting; these mechanisms combine to reduce negative transfer and accelerate multi-domain convergence (Huang et al., 2022).

5. Theoretical Foundations and Convergence Properties

Theoretical advances in MTRL provide formal convergence, efficiency, and generalization guarantees under various assumptions:

For shared-representation and context-based models, convex regularization (e.g., group sparsity) and alternating minimization in joint feature/weight space yield convergence to stationary points with uniform approximation guarantees on per-task value errors (Borsa et al., 2016).
Decentralized policy gradient methods deliver $N$ 8 rates to stationarity, with PL-type guarantees under discounted occupancy-matching conditions (Zeng et al., 2020).
Cross-learning in RKHS provides policy proximity guarantees and convergence of projected policy gradient to near-optimality under smoothness and bounded-variance assumptions (Cervino et al., 2020).
Model-transfer and optimistic multi-task bandit/MDP techniques yield gap-dependent and gap-independent collective regret bounds, demonstrating potential $N$ 9 efficiency gains in suboptimal state-action pairs (Zhang et al., 2021).

6. Empirical Advances: Large-scale and Embodied MTRL

Recent advances expand MTRL to high-dimensional continuous control, vision-based domains, and real-world robotics:

Generalizable visuomotor agents: Large-scale MTRL in the Minecraft domain, using cross-view goal specifications and automated task synthesis, achieves a $(S, A, P_i, R_i, \gamma)$ 0 boost in success rates and strong zero-shot generalization across 3D environments and bare-metal robots (Cai et al., 31 Jul 2025).
Quadrotors and continuous control: Multi-critic architectures leveraging shared platform dynamics support high-speed racing, fast stabilization, and velocity tracking in both sim and real quadrotors, at roughly $(S, A, P_i, R_i, \gamma)$ 1 the sample efficiency of single-task approaches (Xing et al., 2024).
Lifelong/Sequential MTRL: By storing and filtering all past transitions, labeling them with new-task rewards, and annealing their inclusion as the current task is mastered, robotic systems realize halving of sample budgets in sequential curriculum learning without catastrophic forgetting (Xie et al., 2021).

7. Outlook and Open Problems

MTRL is now established as a primary paradigm for scalable RL in domains where skill composition, sample efficiency, and transfer are paramount. Despite these advances, significant questions remain:

Automatic discovery of transferable structure: How to autonomously identify and exploit latent task hierarchies, options, or sub-task representations without domain priors.
Mitigating interference and negative transfer: Although sophisticated gating, normalization, and curriculum strategies address some aspects, discovering mechanistically flexible parameter sharing remains an open challenge.
Scalability and diversity regularization: As agents scale to hundreds of tasks, maintaining plasticity and avoiding capacity bottlenecks become crucial (explicitly observed in neuron-activation analyses (McLean et al., 7 Mar 2025)).
Unsupervised and continual task curricula: Frameworks flexible enough to accommodate lifelong open-ended task arrival in nonstationary environments are actively developing.
Integration with large-scale simulation and real-world robotics: Systems such as CVGS in Minecraft (Cai et al., 31 Jul 2025) and MTRL for quadrotors (Xing et al., 2024) illustrate the pathway to general-purpose, embodied intelligence.

In sum, MTRL has evolved from simple parameter-sharing towards architectures and optimization paradigms that are structured, adaptive, and empirically validated at large scale. The field is characterized by an interplay between theory (generalization, sample efficiency), system design (scalable distributed training, automatic normalization, gating), and empirical progress in real-world and high-dimensional domains. Continuing research will further unify these aspects, forging the path to generalization and lifelong learning in complex, multi-task environments.