Multi-Task Learning as Multi-Objective Optimization
Abstract
The paper discusses an innovative approach to multi-task learning (MTL) by framing it as a multi-objective optimization problem. Traditional methods in MTL often rely on optimizing a proxy objective by minimizing a weighted sum of task-specific losses. However, these approaches constrain tasks to non-competing objectives, which is a rare real-world scenario. The authors propose using gradient-based algorithms from the multi-objective optimization literature to find Pareto optimal solutions, addressing the conflicts between tasks inherently present in MTL. Their contribution includes an efficient algorithm to optimize an upper bound of the multi-objective loss, achieving scalable performance even with large-scale learning challenges. Empirical evaluations on datasets for digit classification, scene understanding, and multi-label classification demonstrate the superiority of the proposed method over existing MTL approaches.
Introduction
The idea of MTL has roots in Stein's paradox, highlighting the benefits of joint estimation across seemingly independent tasks due to shared data-generating processes. In MTL, where multiple tasks are solved concurrently, the inductive biases from one task can beneficially inform others. However, the existing norm in MTL leverages hard or soft parameter sharing, posing challenges when tasks compete over shared resources. The authors emphasize the need to shift from a weighted-sum approach to a multi-objective optimization framework, aiming for Pareto optimality—solutions not dominated across tasks.
Proposed Methodology
The authors propose leveraging the multiple-gradient descent algorithm (MGDA) to navigate the multi-objective landscape. MGDA uses the Karush-Kuhn-Tucker (KKT) conditions to converge on points that are either Pareto optimal or provide a mutually beneficial descent direction across tasks. Recognizing the computational intensity of MGDA, especially in high-dimensional spaces typical of deep network parameters, the authors present a Frank-Wolfe-based optimizer to address these scale challenges.
To further enhance efficiency, the authors introduce an upper-bound approximation of the gradient norm for encoder-decoder architectures prevalent in deep networks. This approximation facilitates the computation of task-specific gradients through a singular backward pass, substantially reducing the computational overhead. The theoretical underpinning assures that this approach yields a Pareto optimal solution, contingent on the non-singularity of certain Jacobian matrices, which is a realistic assumption in practice.
Experiments
The empirical performance of the proposed method was demonstrated across three experimental setups:
- MultiMNIST: An adaptation of the MNIST dataset for MTL, highlighting the method's ability to evenly distribute model capacity between two conflicting digit classification tasks. The proposed method matched the single-task baseline performance, outperforming static or heuristic-based scaling strategies.
- Multi-Label Classification on CelebA: This setting involved treating each face attribute as a separate task, resulting in a 40-task MTL problem. The proposed methodology not only improved average multi-label classification accuracy but also performed consistently well across individual tasks compared to traditional uniform scaling.
- Scene Understanding with Cityscapes: Addressing three distinct tasks (semantic segmentation, instance segmentation, and depth estimation) in a singular framework, the method surpassed single-task performances, indicating beneficial task interactions.
Conclusions and Implications
The research provides a compelling argument for viewing MTL through the lens of multi-objective optimization, offering a pathway to embrace the inherent conflicts of multi-task settings rather than sidestepping them through simplistic aggregations. The approach promises a more robust framework for developing high-capacity models that cater to multiple objectives simultaneously, aligning with the growing trend towards comprehensive AI systems capable of solving a wide array of tasks. Looking forward, this framework will likely stimulate further research into multi-objective methods in MTL and find broader applications across domains that demand simultaneous optimization of competing objectives.
In conclusion, by redefining MTL as a multi-objective optimization problem, the authors lay a foundation for future developments in AI systems that need to efficiently balance trade-offs between multiple learning objectives.