Scheduled Multi-Task Learning (SML)

Updated 23 May 2026

Scheduled Multi-task Learning (SML) is a dynamic framework that modulates task contributions via explicit or adaptive schedules to optimize a main task.
It leverages approaches like adaptive sampling, gradient-based scheduling, and meta-learning to selectively incorporate auxiliary tasks and reduce negative interference.
Empirical results show SML yields significant improvements in areas such as neural translation and vision benchmarks by strategically balancing task emphasis.

Scheduled Multi-task Learning (SML) is a family of methodologies for multi-task learning (MTL) in which the order, frequency, and relative emphasis of different tasks are dynamically controlled according to an explicit or implicit schedule. SML frameworks are designed to maximize performance on the main task by leveraging auxiliary tasks selectively—either by task weighting, adaptive sampling, sequential group updates, or meta-learning policies. This approach stands in contrast to conventional multi-task learning, where task contributions are fixed or determined via static weighting, and represents a convergence of ideas from curriculum learning, meta-learning, and gradient-based task selection.

1. Core Principles and Definitions

The central objective in SML is to learn a shared representation or parameterization $\theta$ (or multi-headed variants with some task-specific parameters) such that the main task’s loss $L_{\text{main}}(\theta)$ is minimized, with strategic incorporation of auxiliary task losses $\{L_k(\theta)\}_{k=1}^K$ . The key innovation in SML frameworks lies in dynamically modulating each task’s contribution via a learned or pre-defined schedule, schedule-adaptive mixing coefficients, or meta-policies—ensuring that task interference is mitigated and positive transfer is maximized.

The most general form of the instantaneous SML objective can be written as:

$L_{\text{total}}(\theta; t) = L_{\text{main}}(\theta) + \sum_{k=1}^K \lambda_k(t) L_k(\theta)$

where $\lambda_k(t)$ are (possibly time-varying or data-dependent) scheduling weights, aligning the optimization trajectory to favor the main task as dictated by an explicit schedule, adaptive policy, or meta-scheduler (Liang et al., 2022, Kiperwasser et al., 2018, Jean et al., 2019, Wu et al., 2020).

2. Scheduling Strategies

SML encompasses a broad spectrum of scheduling strategies:

Static schedules: The contribution of each task is fixed throughout training, e.g. constant mixing ratios or round-robin task selection (Kiperwasser et al., 2018).
Pre-determined dynamic schedules: Schedules shift over time following a hand-designed policy (e.g., exponentially increasing weight on the main task or sigmoidal ramp-up) (Kiperwasser et al., 2018).
Gradient-based scheduling (gradient surgery): At every step, one projects each auxiliary task’s gradient $g_k$ onto the main task’s gradient $g_{\text{main}}$ to retain only the portion aligned (or apply a regularizer if anti-aligned), followed by global time-dependent scaling (Liang et al., 2022).
Adaptive sampling and meta-scheduling: Task selection or weighting is adjusted online based on relative task progress, validation metrics, or a learnable policy—often formulated as a bandit, reinforcement learning, or bi-level optimization problem (Wu et al., 2020, Sharma et al., 2017, Jean et al., 2019).
Affinity-based grouping and sequential updates: Tasks are partitioned into affinity clusters and groups are updated sequentially within each batch, reducing negative transfer and enabling more effective adaptation of task-specific parameters (Jeong et al., 17 Feb 2025).

A summary of representative SML scheduling strategies:

Approach	Scheduling Mechanism	Key Reference
Constant, Exponential, Sigmoid	Fixed or monotonically varying coefficients	(Kiperwasser et al., 2018)
Gradient surgery + time decay	Projection, decay schedule for $\alpha(t)$	(Liang et al., 2022)
Adaptive sampling, RL meta-policy	Bandit, policy-gradients, active task selection	(Sharma et al., 2017, Wu et al., 2020)
Affinity-based grouping	Online affinity metrics + sequential updates	(Jeong et al., 17 Feb 2025)

3. Algorithmic Realizations

Curriculum and Schedule-based Weighting

A general approach is to specify time-varying weights for each task based on an explicit curriculum. For example:

Exponential schedule: $\alpha(t) = 1 - \exp(-\gamma t)$ ,
Sigmoid schedule: $\alpha(t) = 1/(1 + \exp(-\gamma t))$ ,

where $L_{\text{main}}(\theta)$ 0 defines the main task’s proportion and $L_{\text{main}}(\theta)$ 1 is spread over auxiliaries (Kiperwasser et al., 2018). Used in NMT, such schedules progress from heavy auxiliary focus (linguistic structure) to almost pure translation over epochs, embodying a neural curriculum.

Gradient-based Scheduling

In the context of neural chat translation, SML uses a three-stage pipeline: (1) generic pre-training, (2) in-domain pre-training with all auxiliary tasks, (3) fine-tuning with continued scheduling. In Stages 2 and 3, when computing the gradient, each auxiliary gradient $L_{\text{main}}(\theta)$ 2 is projected onto $L_{\text{main}}(\theta)$ 3, scaled by a global $L_{\text{main}}(\theta)$ 4 which decays from 1 to 0:

$L_{\text{main}}(\theta)$ 5

This ensures only the auxiliary task contributions aligned with the main task are accumulated, and the regularizer effect is retained for misaligned tasks. $L_{\text{main}}(\theta)$ 6 is linearly decayed within each stage (Liang et al., 2022).

Affinity-based Grouping and Sequential Updates

SML by selective group update divides tasks into dynamic clusters based on online affinity metrics such as Proximal Inter-Task Affinity (PIA). After each mini-step for a group $L_{\text{main}}(\theta)$ 7, the reduction in loss for each other task $L_{\text{main}}(\theta)$ 8 is measured, and groups are adaptively merged or split. Within a batch, each group is updated sequentially, so that interference is reduced and task-specific information is better preserved:

Batches of tasks with strong positive affinity are sequentially co-updated.
Hyperparameters: affinity decay $L_{\text{main}}(\theta)$ 9, learning rate $\{L_k(\theta)\}_{k=1}^K$ 0, with theoretical conditions ensuring convergence to Pareto stationary points (Jeong et al., 17 Feb 2025).

Adaptive and Learnable Scheduling

Meta-scheduling or adaptive SML uses task performance signals (e.g., validation BLEU or loss) to adapt task weights or select the next training task:

Adaptive sampling: Probability for task $\{L_k(\theta)\}_{k=1}^K$ 1 at a checkpoint is $\{L_k(\theta)\}_{k=1}^K$ 2 where $\{L_k(\theta)\}_{k=1}^K$ 3 is the relative score versus baseline (Jean et al., 2019).
Active RL/bandit scheduling: Task selection is cast as a contextual bandit or full RL problem, with meta-controllers that sample tasks based on observed or inferred gaps to target performance, uncertainty bonuses (UCB), or explicit rewards maximizing multitask performance (Sharma et al., 2017, Wu et al., 2020).
Self-paced MTL: Tasks and instances are prioritized jointly by a self-paced regularizer, beginning with “easier” tasks/instances and gradually moving to harder ones, with group sparsity controlling task-level prioritization (Li et al., 2016).

4. Empirical Findings and Impact

SML methods yield substantial, statistically significant improvements over conventional MTL and single-task learning across diverse application domains:

Neural Machine Translation: SML systematic schedules (esp. exponential) yield +0.7 BLEU gains over strong NMT baselines, and +1.5–2 BLEU gains when augmented with in-domain pre-training and gradient-based scheduling for chat translation (Liang et al., 2022, Kiperwasser et al., 2018).
Vision Multi-task Benchmarks: Affinity-grouped SML methods achieve up to 80% relative improvement on aggregate performance (multi-task metric $\{L_k(\theta)\}_{k=1}^K$ 4) compared to joint-gradient baseline and prior state-of-the-art MTL optimizers (Jeong et al., 17 Feb 2025).
Sequence Learning with Temporally Correlated Tasks: Bi-level and RL-based SML outperform uniform and curriculum transfer on both simultaneous translation and time-series (forecasting) benchmarks, yielding up to +3 BLEU and +0.05 RankIC relative gains (Wu et al., 2020).
Atari Reinforcement Learning: Active sampling SML mechanisms double the multitask normalized mean reward compared to uniform sampling baselines, with robustness across 6–21 task setups (Sharma et al., 2017).
Robustness and Ablations: Ablation studies confirm the importance of scheduler dynamics, group affinity, inverse-projection regularization, and time-dependent weighting schedules. Removing adaptive scheduling uniformly decreases main task performance and exacerbates negative transfer (Liang et al., 2022, Jeong et al., 17 Feb 2025, Jean et al., 2019).

5. Theoretical Frameworks and Analysis

Several SML variants provide explicit theoretical guarantees:

Gradient-Alignment and Loss Reduction: Affinity-based sequential updates provably yield better gradient alignment and lower main-task loss compared to joint updates, for convex losses and small learning rates (Jeong et al., 17 Feb 2025).
Convergence: Under standard Lipschitz-gradient assumptions, sequential group updates converge to Pareto-optimal solutions across tasks (Jeong et al., 17 Feb 2025).
Meta-learned Schedules: Bi-level optimization is formalized with the outer loop targeting main-task validation loss and the inner loop adapting the parameters via scheduled task sampling, using REINFORCE gradients to train the scheduler policy (Wu et al., 2020).
Self-paced Regimes: SPMTL’s group sparse regularizer ensures “easy” tasks/instances are learned first, with convergence guaranteed by block coordinate descent (Li et al., 2016).

6. Limitations and Open Issues

Computational Overhead: Gradient-based scheduling and meta-learning approaches require multiple backprops or extra forward passes, although group-based SML can reduce memory and compute to nearly $\{L_k(\theta)\}_{k=1}^K$ 5 in the number of tasks (Jeong et al., 17 Feb 2025).
Schedule Design and Hyperparameter Sensitivity: Hand-tuned schedules can outperform learned schedules in specific settings, but meta-learned schedules generalize better as the task count and task diversity grow (Jean et al., 2019).
Sensitivity to Task Definition: Effectiveness relies on auxiliary tasks being positively correlated with the main task. Poorly chosen or noisy auxiliary tasks can induce negative transfer even with SML (Jeong et al., 17 Feb 2025).
Scalability: Grouping strategies and meta-policies scale to $\{L_k(\theta)\}_{k=1}^K$ 6 tasks but may require adaptation for extreme multitask settings or with very large task-specific parameter sets.

7. Connections and Future Directions

SML is closely related to curriculum learning, multi-objective optimization, adaptive loss weighting, and meta-learning. Recent work situates SML as unifying pre-training, standard MTL, and fine-tuning within a single procedural framework (Kiperwasser et al., 2018, Liang et al., 2022). A plausible implication is that further integration with LLM pre-training, more expressive meta-schedulers (e.g., transformer-based), and automatic curriculum discovery will continue to expand the domain of applicability.

References:

"Scheduled Multi-task Learning for Neural Chat Translation" (Liang et al., 2022)
"Selective Task Group Updates for Multi-Task Optimization" (Jeong et al., 17 Feb 2025)
"Adaptive Scheduling for Multi-Task Learning" (Jean et al., 2019)
"Scheduled Multi-Task Learning: From Syntax to Translation" (Kiperwasser et al., 2018)
"Temporally Correlated Task Scheduling for Sequence Learning" (Wu et al., 2020)
"Learning to Multi-Task by Active Sampling" (Sharma et al., 2017)
"Self-Paced Multi-Task Learning" (Li et al., 2016)