MTL as Multi-Objective Optimization

Updated 12 January 2026

Multi-task learning is a framework where each task’s loss is treated as a separate objective, leading to Pareto optimal solutions.
Algorithmic approaches such as scalarization, dynamic weighting, and gradient-based methods efficiently balance performance across tasks.
Modern techniques leverage hypernetworks and constrained optimization to navigate high-dimensional, nonconvex loss landscapes.

Multi-task learning (MTL) as multi-objective optimization (MOO) is a formalism in which each task’s loss in a multi-task system is treated as a distinct objective, and solutions are sought that reflect optimal trade-offs—typically captured via notions of Pareto optimality—from the theory of vector optimization. This perspective has enabled the rigorous development of algorithmic, theoretical, and practical frameworks for deep learning architectures, leading to improved performance, better interpretability of trade-offs among tasks, and advances in optimization methodology for high-dimensional, stochastic, and nonconvex systems.

1. Mathematical Foundations of MTL as Multi-Objective Optimization

Classically, multi-task learning considers the joint minimization of per-task losses. Formally, for a parameter vector $\theta\in\mathbb{R}^d$ and $T$ tasks with individual continuous, differentiable losses $L_t(\theta)$ , the MTL problem is represented as the vector-valued objective:

$\min_{\theta\in\mathbb{R}^d} \; \mathbf{L}(\theta) = (L_1(\theta), ..., L_T(\theta))^\top$

This is a multi-objective optimization problem (MOP), in which, in general, no single $\theta^*$ exists that simultaneously minimizes all $L_t$ unless the minima coincide. The solution concept underpinning this framework is Pareto optimality, where a solution $\theta^*$ is Pareto-optimal if no other $\theta$ achieves $\mathbf{L}(\theta)\leq \mathbf{L}(\theta^*)$ coordinatewise with at least one strict inequality. The set of non-dominated solutions is the Pareto set, and its image in loss space is the Pareto front (Sener et al., 2018, Lin et al., 2019, Peitz et al., 2024).

The first-order necessary conditions for $\theta^*$ to be Pareto-optimal are the existence of convex combination coefficients $\alpha\in\Delta^{T-1}$ such that:

$\sum_{t=1}^T \alpha_t \nabla_\theta L_t(\theta^*) = 0, \qquad \alpha_t \geq 0,\, \sum_t \alpha_t = 1$

Any stationary point of a scalarization of $\mathbf{L}$ for some $\alpha$ , under mild regularity, is Pareto-critical.

2. Principal Algorithmic Approaches

Algorithmic strategies for MTL as MOO fall into several categories (Peitz et al., 2024, Sener et al., 2018, Lin et al., 2019):

Approach	Description	Trade-off/Scope
Scalarization ("decide-then-optimize")	Optimize weighted sums: $\min_\theta \sum w_t L_t(\theta)$ for fixed $w$	Only recovers convex front segments; requires multiple runs for full front
Dynamic weighting	Vary $w$ online (e.g., GradNorm, DWA) to adapt relative task progress	Heuristic adaptation, not fully grounded in Pareto theory
Gradient-based MOO (e.g., MGDA)	Solve at each step a small QP for $\alpha$ to obtain Pareto-stationary descent	Can reach non-convex and boundary fronts; efficient for moderate $T$
Pareto-front approximation ("optimize-then-decide")	Solve many scalarized subproblems for diverse $w$ ; population or evolutionary algorithms	Covers the front densely but higher compute cost

Several variants extend these strategies with additional regularizers (PCGrad, CAGrad), differential constraint handling, or Pareto front parameterization via hypernetworks (Lin et al., 2020, Gupta et al., 2021, Peitz et al., 2024).

3. Modern Techniques and Pareto Front Characterizations

a. Decomposed Scalarization and Joint Optimization

Recent works frame MTL as a collection of unconstrained, preference-weighted subproblems. Instead of independent optimization, iterative parameter transfer among subproblems accelerates convergence and improves coverage:

Multi-Task Learning with Multi-Task Optimization (MT²O) introduces a mixing matrix in the update of each parameter slot across subproblems, leading to provably faster joint convergence with respect to the spectral radius of the global update operator (Bai et al., 2024).

b. Continuous Pareto Manifold Approximation

Controllable Pareto MTL and low-rank Pareto manifold frameworks leverage hypernetworks to parameterize the mapping from trade-off preferences $\mathbf{p}\in\Delta^{T-1}$ to network weights, allowing real-time inference and dense exploration of the Pareto front using a single, manageable network:

Controllable Pareto Multi-Task Learning (CPMTL) constructs a hypernetwork generator $g_\phi$ such that $\theta_\mathbf{p}=g_\phi(\mathbf{p})$ for any chosen $\mathbf{p}$ (Lin et al., 2020).
Efficient Pareto Manifold Learning with Low-Rank Structure uses a low-rank adaptation on top of a common shared core, scaling to large $T$ and reducing parameter cost (Chen et al., 2024).

c. Constrained, Priority-Aware, and RL-Based MOO

Prioritized Multi-Task Learning with Lagrangian Differential Multiplier Methods formalizes a sequence of constrained subproblems, optimizing lower-priority tasks without degrading higher-priority ones using Lagrange multipliers (Cheng et al., 2024).
RL-based MOO frames the encoder learning (e.g., in semantic broadcast communications) as a constrained multi-objective problem solved via PPO with adaptive, multi-gradient aggregation for task weighting and alternating optimization across tri-level loops (Lu et al., 28 Apr 2025).

d. Pareto Set Exploration

Continuous Pareto front exploration via tangent-space characterization and Hessian-free Newton/Krylov methods enables the local expansion of the Pareto set and the construction of piecewise-linear approximations to the front, accounting for high-dimensional curvature (Ma et al., 2020).

4. Practical Considerations and Theoretical Insights

a. Loss Landscape and Overparameterization

In deep MTL, loss landscapes are highly nonconvex, and overparameterization can trivialize the trade-off structure: with sufficient capacity and appropriate architectural partitioning (disjoint task heads), all tasks can be simultaneously minimized, leading to a Pareto front that collapses to a single optimum (Ruchte et al., 2021). Genuine Pareto trade-offs only emerge under capacity constraints or explicit parameter sharing.

b. Convergence Guarantees

Most gradient-based MOO methods have convergence guarantees to Pareto-stationary points under mild smoothness and convexity assumptions. For instance, iterative methods such as SDMGrad and MGDA converge to $\epsilon$ -Pareto-stationarity at rates determined by step sizes, complexity of the QP, and number of tasks (Xiao et al., 2023, Peitz et al., 2024). Methods leveraging explicit preference vectors and transfer (e.g., MT²O) secure a faster contraction compared to fully independent optimization (Bai et al., 2024).

c. Algorithmic Efficiency

For moderate task numbers ( $T\lesssim20$ ), explicit QP-based or hypernetwork-based multi-objective solvers are tractable.
For large $T$ , scalable approaches include low-rank manifold representations, objective sampling, and stochastic optimization (Chen et al., 2024, Xiao et al., 2023).
Evolutionary and surrogate-model methods (e.g., MOEA/WST) offer efficient Pareto-front coverage where gradient computation is expensive, introducing Wasserstein-metric diversity in the solution set (Ponti, 2021).

5. Extensions, Controversies, and Limitations

a. Aligned vs. Conflicting Objectives

Recent studies (AMOO) highlight scenarios where task objectives are "aligned," i.e., share an exact or approximate minimizer. In these regimes, using weighted GD with adaptive curvature-aligned weights leads to superior convergence compared to trade-off-oriented methods, and elaborate Pareto methods are not required (Efroni et al., 19 Feb 2025). Conversely, failure to account for objective alignment can result in unnecessary complexity and missed acceleration opportunities.

b. Empirical Front Collapse

Empirical studies show that in typical MTL benchmarks (e.g., MultiMNIST, FashionMNIST overlays), with sufficient model capacity, methods attempting to reconstruct the Pareto front (MGDA, PHN, etc.) produce nearly degenerate (collapsed) fronts, corroborating the theoretical prediction that capacity—not inherent task conflict—governs the trade-off landscape (Ruchte et al., 2021).

c. Sustainability and Sparsification

Multi-objective formalisms naturally extend to regularization and network compression, where sparsity is introduced as an additional objective. Modified Chebyshev scalarizations with augmented Lagrangian solvers yield sparse, yet high-performing, multi-task networks. Adaptive tying of parameter blocks can further reduce model complexity without significant loss of accuracy (Hotegni et al., 2023).

d. Prioritization and Constrained Trade-off Enforcement

Advanced frameworks provide explicit mechanisms to encode strict priority orders among tasks (no degradation allowed for high-priority losses), formulated as constrained optimization solved via differential Lagrangian multipliers, thereby automating task weight adaptation and eliminating manual hyperparameter sweeps (Cheng et al., 2024).

6. Taxonomy, Applications, and Practical Recommendations

A comprehensive taxonomy covers:

Scalarization methods: single compromise solution, grid sweeps.
Pareto-based/gradient-based methods: explicit stationarity enforcement, conflict-aware updates.
Pareto front approximation: multi-run, population-based, continuation/hypernetwork parameterization.
Constraint handling: sequential, Lagrangian, bi- or tri-level formulations.

Applications span computer vision (semantic segmentation, depth estimation, multi-label classification), natural language processing, reinforcement learning (multi-reward RLHF, multi-task control), recommender systems, and resource-aware neural architecture design (Sener et al., 2018, Bai et al., 2024, Peitz et al., 2024, Lu et al., 28 Apr 2025).

7. Outlook and Open Problems

Key challenges remain in scaling MOO-based MTL to extremely large $T$ , designing benchmarks with genuinely conflicting tasks, and unifying theory for stochastic, nonconvex objectives. Further research is warranted on principled integration with continual learning, constrained resource settings, and feedback-rich domains where priorities must adapt online. The Pareto formalism offers a principled substrate for these advances, but the interplay between model capacity, task heterogeneity, and optimization geometry remains an open field of inquiry (Peitz et al., 2024, Ruchte et al., 2021).