Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning (2111.10952v2)

Published 22 Nov 2021 in cs.CL and cs.LG

Abstract: Despite the recent success of multi-task learning and transfer learning for NLP, few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training.

Citations (203)

Summary

  • The paper demonstrates that extreme multi-task pre-training with 107 tasks enhances performance and sample efficiency across various NLP benchmarks.
  • It introduces ExMix, a comprehensive dataset for encoder-decoder models that combines self-supervised span denoising with supervised task objectives.
  • Experimental results reveal both significant gains in transfer learning and challenges with negative task interactions, highlighting the need for careful task selection.

ExT5: Advancing Multi-Task Learning with Extreme Task Scaling

The paper "ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning" introduces a foundational paper on the implications of extensive multi-task scaling in pre-training NLP models. Grounded in the paradigm of transfer learning, the research seeks to deepen the understanding of how increasing the number of tasks during a model’s pre-training can enhance performance across various NLP tasks. To this end, the authors present ExMix, a substantial compendium of 107 supervised NLP tasks spanning multiple domains and task families, which serves as the basis for proposed extreme multitask scaling endeavors.

Methodological Approach

The proposed methodology encompasses the creation of ExMix, a diverse mixture integrating 107 varied tasks, formatted for compatibility with encoder-decoder architectures. This design decision facilitates parameter sharing and uniform task processing, thus simplifying multi-task learning operations. A significant contribution is the development of ExT5, a model pre-trained with a unique multi-task objective combining self-supervised span denoising and supervised ExMix tasks.

The authors conduct empirical analyses of task co-training transfer relations across different task families to establish an understanding of potential positive and negative transfer patterns. This empirical groundwork underscores the challenges inherent in manually curating task sets for optimal pre-training mixtures, revealing that task selection impacts are non-trivial and context-sensitive. Consequently, extreme scaling by inclusion of diverse pre-training tasks is hypothesized as a beneficial strategy.

Experimental Results

Experimentally, ExT5 is benchmarked against strong T5 baselines across varied NLP task benchmarks—SuperGLUE, GEM, Rainbow commonsense reasoning, Closed-Book QA tasks, among others. The empirical findings indicate that ExT5 consistently surpasses T5 baselines across multiple metrics. Notably, ExT5 displays superior sample efficiency during pre-training, suggesting improved learning capacity over equivalent unlabeled token exposure.

Further experiments on task scaling affirm the positive correlation between the number of multi-task pre-training tasks and improved downstream task performance, particularly under large batch training conditions. Additionally, the exploration of task-specific transfer further elucidates intra-family task relationships, presenting insights into situations of negative transfer, which can notably degrade performance.

Implications and Future Directions

The paper's contributions challenge existing paradigms in multi-task and transfer learning by demonstrating that extreme task scaling can serve as a powerful tool for enhancing pre-trained model performance across NLP tasks. This insight paves the way for advancing models with better generalization capabilities, addressing catastrophic forgetting and negative transfer issues inherent in existing approaches.

The foundations laid by ExT5 hold significant implications for future AI developments. A multilingual expansion of ExT5 remains a tangible future research trajectory, promising potential in cross-lingual transfer learning. Moreover, integrating inductive biases into model architectures and exploring gradient manipulation methods are prospective avenues for refining extreme multi-task scaling techniques.

In conclusion, this research outlines the substantial potential of large-scale multitask pre-training to enhance NLP models, urging a re-evaluation of pre-training strategies to incorporate diverse, high-quality supervised learning signals. As AI continues to accelerate, such carefully crafted approaches promise to expand the versatility and efficiency of LLMs, contributing to an evolved understanding and application of transfer learning in NLP.

X Twitter Logo Streamline Icon: https://streamlinehq.com