In Defense of the Unitary Scalarization for Deep Multi-Task Learning (2201.04122v4)

Published 11 Jan 2022 in cs.LG, cs.AI, and cs.CV

Abstract: Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We show that unitary scalarization, coupled with standard regularization and stabilization techniques from single-task learning, matches or improves upon the performance of complex multi-task optimizers in popular supervised and reinforcement learning settings. We then present an analysis suggesting that many specialized multi-task optimizers can be partly interpreted as forms of regularization, potentially explaining our surprising results. We believe our results call for a critical reevaluation of recent research in the area.

Citations (71)

View on Semantic Scholar

Summary

The paper demonstrates that unitary scalarization achieves comparable performance to specialized multi-task optimization algorithms with significantly lower computational cost.
Experimental results in both supervised and reinforcement learning settings indicate that complex optimizers often act as implicit regularizers that may lead to overfitting.
The theoretical analysis reveals that increased algorithmic complexity does not necessarily enhance task performance and can under-optimize compared to the simpler unitary approach.

In Defense of the Unitary Scalarization for Deep Multi-Task Learning

The paper "In Defense of the Unitary Scalarization for Deep Multi-Task Learning" critically examines the recent trend in specialized multi-task optimization algorithms and evaluates the effectiveness of the traditional unitary scalarization method. The authors argue that while many recent multi-task learning (MTL) methods introduce complex algorithms requiring per-task gradients, unitary scalarization combined with standard regularization techniques can perform similarly well in both supervised and reinforcement learning settings. They suggest these specialized optimizers may not always be necessary, and through a series of experiments and technical analyses, they speculate these methods often act as regularizers rather than inherently superior optimizers.

Overview of Multi-Task Optimization Methods

Unitary scalarization, the simplest formulation for MTL, minimizes the sum of per-task losses without the need for individual task gradients. Despite the introduction of many complex methods like Multiple-Gradient Descent Algorithm (MGDA), Impartial Multi-Task Learning (IMTL), and PCGrad, these methods require higher computational resources due to the computation of per-task gradients. Notably, unitary scalarization only requires a single backward pass, significantly reducing overhead.

Empirical Comparison and Analysis

The authors rigorously benchmarked unitary scalarization against several notable \glspl{smto} including MGDA, IMTL, GradDrop, and others, using a unified experimental setup across both supervised learning datasets (Multi-MNIST, CelebA, Cityscapes) and reinforcement learning benchmarks (Meta-World MT10 and MT50). They noted that none of the specialized optimizers consistently outperformed unitary scalarization. The results highlighted that unitary scalarization, when paired with simple regularization techniques like dropout and weight decay, matched or even outperformed more complex models.

Key results across these experiments showed the practical inefficiency of \glspl{smto} when compared to the low computational cost and competitive performance of unitary scalarization. The empirical analysis demonstrated the occurrence of overfitting in specialized optimizers, suggesting that they mainly contribute through a form of implicit regularization.

Theoretical Implications and Regularization Theory

A significant portion of the paper is devoted to analyzing specialized optimizers through the lens of regularization. The authors propose that the perceived effectiveness of these optimizers is primarily due to their implicit regularization capabilities. They provide theoretical insights into the broad convergence sets of the MGDA and IMTL optimizers, which include stationary points that may not necessarily correspond to optimal task performance but mitigate overfitting tendencies.

The paper suggests that MGDA and IMTL often under-optimize compared to unitary scalarization, yet help in guiding training towards flatter, generalizable solutions. Furthermore, the stochasticity in PCGrad and GradDrop introduces noise that mimics known regularization techniques, further supporting the authors’ claims.

Future Developments and Conclusion

This work challenges the necessity and practicality of utilizing complex multi-task optimizers in real-world applications. By scrutinizing and empirically evaluating these methods against a simple baseline, it encourages the field to reconsider the trajectory of MTL research. The authors call for future research to discover scenarios where these specialized techniques may be truly beneficial and emphasize the importance of transparency and clarity in the evaluation of MTL algorithms. Moreover, they advocate for additional studies focused on understanding the intricacies of regularization within multi-task settings, which could lead to more effective and efficient utilization of MTL frameworks.

In conclusion, the authors highlight the importance of critical assessment in the development of machine learning models, advocating for simpler, more interpretable solutions unless significant performance improvements justify the additional complexity. The paper suggests revisiting fundamental machine learning principles and leveraging them to advance multi-task learning rather than excessively relying on complicated optimization paradigms without clear empirical benefits.