Multi-task Loss Function

Updated 8 February 2026

Multi-task loss functions are composite objectives that integrate weighted task-specific losses to enable joint optimization across diverse tasks.
They use strategies like uncertainty-based and gradient-norm weighting to balance loss scales and resolve optimization conflicts.
Advanced methods incorporate cross-task regularizers and alignment terms, leading to enhanced model stability in vision, NLP, and recommender systems.

A multi-task loss function is a composite objective used in multi-task learning (MTL) to jointly optimize several task-specific losses within a shared model architecture. This loss function is central to modern deep learning approaches that address multiple, possibly heterogeneous objectives simultaneously by combining their respective error signals into a single differentiable training objective. Multi-task loss functions are essential for enabling knowledge transfer across tasks, balancing performance, and resolving conflicts in joint optimization settings.

1. General Formulation and Variants

The canonical multi-task loss function aggregates per-task losses $\{L_k\}_{k=1}^K$ for $K$ tasks into a scalar training objective via a weighting scheme: $L_{\text{total}} = \sum_{k=1}^K w_k L_k(\theta)$ where $w_k$ is the weight for task $k$ and $\theta$ represents the shared model parameters. Choices for $w_k$ profoundly impact the training dynamics. Several notable formulations appear in the literature:

Equal weighting: $w_k = 1$ for all $k$
Uncertainty-based weighting: weights are inversely proportional to aleatoric task uncertainty, commonly set as $1/\sigma_k^2$ with a learned $\sigma_k$ (Kirchdorfer et al., 2024, Silva et al., 2020)
Gradient-norm-based schemes: $w_k \propto 1/\|\nabla_\theta L_k\|$ to equalize per-task influence (Crawshaw et al., 2021)
Geometric mean: $L_{\text{geo}}=(\prod_{k=1}^K L_k)^{1/K}$ providing inherent scale-invariance and balanced gradient allocation (Chennupati et al., 2019)
Pairwise or cross-consistency loss: additional terms may directly tie outputs across tasks through ranking or alignment constraints (Durmus et al., 2024, Nakano et al., 2021)

For heterogeneous tasks (e.g., classification + regression), extra normalization or explicit balancing is often required to prevent domination by a single head due to raw loss scale.

2. Task Weighting and Dynamic Strategies

Choosing or learning the weights $w_k$ is a central challenge. Several adaptive and theoretically grounded strategies have been proposed:

Homoscendastic uncertainty weighting: Model each task's intrinsic noise via $\sigma_k$ , yielding the joint objective

$L(\theta, \{\sigma_k\}) = \sum_{k=1}^K \left(\frac{1}{\sigma_k^2} L_k(\theta) + \log \sigma_k\right)$

This scheme was formalized in (Kirchdorfer et al., 2024, Silva et al., 2020), and is widely deployed for vision, NLP, and recommender systems.

Analytical uncertainty weighting with softmax (UW-SO): Analytically solve for the optimal $\sigma_k$ , then softmax-normalize inverse losses for stability:

$\tilde{\omega}_k = \frac{\exp(1/\text{stopgrad}[L_k]/T)}{\sum_{j} \exp(1/\text{stopgrad}[L_j]/T)}$

(Kirchdorfer et al., 2024)

Scaled Loss Approximate Weighting (SLAW): Approximate per-task gradient magnitude by exponentially-averaged loss standard deviation, assigning higher weights to flatter-loss tasks:

$w_k = \frac{K}{s_k}\Big{/}\sum_{j=1}^K \frac{1}{s_j}$

where $s_k \simeq \sqrt{\mathbb{E}[L_k^2] - (\mathbb{E}[L_k])^2}$ (Crawshaw et al., 2021)

HydaLearn: For primary–auxiliary task pairs, recompute $w_k$ at every batch to maximize primary task metric gain, based on simulated single-step gradient improvement (Verboven et al., 2020)
FairGrad: Compute $w$ to optimize a chosen $\alpha$ -fairness utility of directional loss decrease $g_k^T d$ , yielding weights that interpolate between equal, proportional, and max-min allocation (Ban et al., 2024)
LDC-MTL (BiLB4MTL): Bilevel optimization over weight vectors to minimize pairwise loss discrepancies after a coarse normalization; achieves Pareto-stationary solutions with only $O(1)$ time and memory per step (Xiao et al., 12 Feb 2025)

3. Specialized Loss Structures and Consistency Terms

Beyond scalarization, modern multi-task losses frequently incorporate additional cross-task regularizers or contrastive terms:

Alignment and Cross-Task Consistency: Explicitly regularize the outputs of one task to be recoverable from another via small auxiliary networks, enforcing output consistency or cycle-consistency losses, e.g. $L_{\text{XTC}} = \|\hat{y}_1 - \mathcal{F}_\theta(\hat{y}_2)\|^2$ (Nakano et al., 2021)
Pairwise Ranking Loss: For cascaded or causally-linked tasks (CTR, CVR), penalize margin violations between predictions of primary and derived tasks as an auxiliary loss:

$L_{\text{PWiseR}} = \frac{1}{N}\sum_{i\in S_c}\sum_{j\in S_p} \mathbb{1}\big(\hat{y}_i - \hat{y}_j + m>0\big)\big(\hat{y}_i - \hat{y}_j + m\big)^2$

(Durmus et al., 2024)

Triplet or contrastive losses as auxiliary objectives (e.g., triplet loss between title and description embeddings alongside tagging loss) (Siskind et al., 2021)
Cycle-consistent or feature-based perceptual losses: In vision, perceptual distances in multi-task-trained feature encoders are used as regularizers for image-to-image models (Zhu et al., 2023)

4. Optimization Implications and Practical Considerations

Multi-task loss landscape complexity, optimization stability, and convergence behavior are distinct challenges:

Loss scale and imbalance: Strongly varying raw loss values (classification vs regression) can lead to poor minima if not normalized or adaptively weighted (Chennupati et al., 2019, Kirchdorfer et al., 2024).
Gradient conflicts: Orthogonality or incompatibility between per-task gradients may prevent progress on some tasks; gradient manipulation and discrepancy-minimizing schemes resolve such issues (Ban et al., 2024, Xiao et al., 12 Feb 2025).
Sample composition and drift: Dynamic weighting per mini-batch (e.g., in HydaLearn) addresses batch-wise variance in task relevance or informativeness (Verboven et al., 2020).
Computational scalability: O(1) methods like SLAW or BiLB4MTL scale to high task counts, while gradient-based schemes like MGDA do not (Crawshaw et al., 2021, Xiao et al., 12 Feb 2025).

Empirical results consistently show that adaptive or balanced weighting—via uncertainty, geometric mean, or discrepancy-control—yields superior average performance and less per-task variance compared to naive loss summation. Choice of normalization scheme has significant empirical effects (Xiao et al., 12 Feb 2025, Kirchdorfer et al., 2024).

5. Domain-Specific Instantiations

Multi-task loss functions are instantiated differently across domains:

Vision: Pixel-wise or semantic losses are combined (e.g., cross-entropy for segmentation, Huber/MSE for depth, geometric mean or softmax-weighted scalarization) (Chennupati et al., 2019, Kirchdorfer et al., 2024)
Audio: Multi-task networks for event detection/localization combine weighted sigmoid cross-entropy, regression losses for event boundary localization, and possibly auxiliary regression on source position (Phan et al., 2017, Phan et al., 2020)
NLP: Combined token-level tagging and supplementary contrastive losses (e.g., triplet losses over document pairs) (Siskind et al., 2021)
Recommender Systems: Parallel classification (CTR, CVR) and regression (order volume) objectives are tied with custom total-probability and sequence-aware loss expressions, with uncertainty-weighted multi-heads (Jiang et al., 2020)
Robotics: Multi-output CNNs for grasp quality, angle, width, and auxiliary depth, sometimes with spatially-masked loss terms to focus on salient regions (Prew et al., 2020)

6. Empirical Evaluation and Impact

A substantial literature demonstrates that carefully constructed and dynamically weighted multi-task losses outperform naive or constant-weight schemes on a variety of benchmarks, yielding both higher mean performance and narrower inter-task variance:

Method	Domains	Construction	Scalability	Notable Metric Gains
UW, UW-O, UW-SO	Vision, Recommender	Uncertainty-weighted sum / softmax	O(1) per batch	Consistently best or SoTA on CelebA, Cityscapes (Kirchdorfer et al., 2024, Jiang et al., 2020)
SLAW	Vision, Drug, Regression	Loss std-weighted sum	O(1) per batch	Uniform performance as T→100+ (Crawshaw et al., 2021)
GLS	Vision	Geometric mean	O(1) per batch	+23% segmentation mIoU (Chennupati et al., 2019)
LDC-MTL (BiLB4MTL)	Vision, Chemistry	Bilevel, loss-discrepancy focus	O(1) per batch	Superior Δm% and runtime (Xiao et al., 12 Feb 2025)
HydaLearn	Mortality, Mortgage	Dynamic, per-batch metric-gain	O(1), needs 3x grad	Outperforms static, GradNorm (Verboven et al., 2020)
FairGrad	Vision, RL	α-fairness, gradient-level	O(K), does not scale	Best rank, Δm% on multi-task (Ban et al., 2024)
Cross-task Consistency	Vision	Auxiliary XTC/ALIGN losses	O(1) per batch	Best mIoU/rel. error (Nakano et al., 2021)

Experimental results also highlight practical nuances such as diminishing gains for advanced weighting in high-capacity models and the necessity for joint hyperparameter tuning (learning rate, normalization, temperature parameters) (Kirchdorfer et al., 2024, Xiao et al., 12 Feb 2025).

7. Limitations and Operational Guidelines

No universal weighting or structuring strategy is optimal for all MTL problems. Careful empirical validation is required to select between static, uncertainty-driven, geometric, or fairness-based criteria. Limitations of current practices include sensitivity to initial weighting, difficulties with highly imbalanced or noisy tasks, and scalability issues for gradient-based balancing in high task-count regimes. Batch-level dynamic methods and bilevel schemes address some but not all of these challenges.

Key recommendations include:

Normalize heterogeneous losses prior to weighting
Prefer adaptive or theoretically grounded weighting schemes (UW, SLAW, LDC-MTL) for robust balancing
Whenever inter-task information is essential, add explicit alignment, consistency, or ranking regularizers
Tune per-task weights, normalization, or softmax temperatures with respect to validation performance metrics suited to application goals
Validate not only aggregate performance but also per-task degradation or negative transfer (Silva et al., 2020)

In sum, the multi-task loss function is a foundational technical construct in modern MTL, with a growing body of research providing rigorous strategies for dynamic weighting, inter-task regularization, and empirically robust optimization (Kirchdorfer et al., 2024, Chennupati et al., 2019, Xiao et al., 12 Feb 2025, Nakano et al., 2021, Siskind et al., 2021, Jiang et al., 2020).