Multitask Prompted Training Enables Zero-Shot Task Generalization (2110.08207v3)

Published 15 Oct 2021 in cs.LG and cs.CL

Abstract: LLMs have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in LLMs' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.

PDF Abstract

Multitask Prompted Training Enables Zero-Shot Task Generalization: An Overview

The paper is an empirical investigation into the viability of inducing zero-shot task generalization through explicit multitask learning as opposed to relying solely on implicit multitask learning during the pretraining of LLMs. Specifically, this work scrutinizes whether training an LLM across a diverse set of tasks, formatted using natural language prompts, can enhance the model’s zero-shot performance on unseen tasks.

Abstract

The authors present a framework for converting diverse NLP tasks into human-readable prompts and demonstrate the efficacy of their approach by fine-tuning a T5-based model on multiple datasets. The results show that the fine-tuned model, termed T0, achieves competitive zero-shot performance, often surpassing models significantly larger in size.

Introduction

The success of LLMs in zero-shot generalization has primarily been attributed to multitask learning during pretraining where the model is exposed to a range of implicit tasks present in its vast training corpus. This paper examines whether explicitly fine-tuning LLMs using a mixture of multiple NLP tasks formulated through prompts can enhance zero-shot generalization without resorting to massive model sizes.

Methodology

The authors propose creating a universal prompt format supported by a templating language that allows easy conversion of dataset instances into prompted examples. They collect a broad set of prompts from public contributors to build a diverse training set, emphasizing varied wording to ensure robustness to different prompt formulations.

The main variant, T0, is built by augmenting a pretrained encoder-decoder model (T5+LM, a variant of T5) through multitask fine-tuning using the newly generated prompted datasets. They also examine other variants such as T0+ and T0++, trained on additional datasets to further explore the impacts of increased dataset diversity on model performance.

Evaluation

The authors benchmarked the zero-shot performance of T0 on multiple held-out tasks, including natural language inference (NLI), coreference resolution, word sense disambiguation, and sentence completion. They extend the evaluation to novel tasks from the BIG-bench benchmark. T0’s performance is compared against several models including various GPT-3 sizes.

Results

Zero-Shot Generalization:
- T0 consistently outperforms the baseline T5+LM without multitask training across several held-out tasks.
- Notably, T0 often matches or surpasses GPT-3 models up to 175B parameters in size despite being significantly smaller (11B parameters).
Prompt Robustness:
- Experiments demonstrate that increasing the number and diversity of prompts leads to improved median performance and decreased variability on unseen tasks.
- Training on a more extensive collection of tasks generally boosts performance, though does not consistently reduce performance variability.
Comparison with Alternative Models:
- T0 and its variants achieve competitive results against FLAN, another multitask prompted model. In some cases, T0++ achieves higher accuracy despite being an order of magnitude smaller in parameter count.
- T0 shows strong performance on BIG-bench tasks, often surpassing baseline models and indicating effective generalization to novel NLP concepts.

Implications

Theoretical Implications

The findings reinforce the hypothesis that explicit multitask learning can be a potent mechanism to achieve zero-shot generalization. This experimentation delineates the benefits of task and prompt diversity during LLM fine-tuning, suggesting that multitask prompted training can yield robust and adaptable models without the need for extremely large parameter counts.

Practical Implications

The practical implications are significant. T0's ability to generalize well to unseen tasks at a fraction of GPT-3's size implies substantial resource savings in both deployment and inference stages. It democratizes the access to high-performing zero-shot learning models by reducing the computational and financial barrier to training and deploying large LLMs, thus extending their benefits to a broader range of applications and institutions.

Future Work

Future research may involve:

Exploring the upper limits of generalization achieved by further increasing prompt and task diversity.
Fine-tuning specific aspects of the prompting framework to enhance semantic understanding and task adaptability.
Investigating the balance between model size, diversity of training data, and performance outcomes to optimize resource utilization.

This paper lays essential groundwork for advancing zero-shot task generalization through multitask learning, offering a pragmatic and scalable alternative to model pretraining at massive scales. Its findings are poised to influence future AI developments, propelling more efficient and versatile models in natural language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (41)

Victor Sanh (21 papers)
Albert Webson (19 papers)
Colin Raffel (83 papers)
Stephen H. Bach (33 papers)
Lintang Sutawika (14 papers)
Zaid Alyafeai (21 papers)
Antoine Chaffin (13 papers)
Arnaud Stiegler (3 papers)
Teven Le Scao (18 papers)
Arun Raja (4 papers)
Manan Dey (15 papers)
M Saiful Bari (22 papers)
Canwen Xu (32 papers)
Urmish Thakker (26 papers)
Shanya Sharma Sharma (1 paper)
Eliza Szczechla (2 papers)
Taewoon Kim (10 papers)
Gunjan Chhablani (14 papers)
Nihal Nayak (2 papers)
Debajyoti Datta (12 papers)

Citations (1,573)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/BlancheMinerva/status/1920856015060214010

YouTube

Show All Videos