ZeroGen: Efficient Zero-shot Learning via Dataset Generation (2202.07922v2)

Published 16 Feb 2022 in cs.CL and cs.AI

Abstract: There is a growing interest in dataset generation recently due to the superior generative capacity of large pre-trained LLMs (PLMs). In this paper, we study a flexible and efficient zero-short learning method, \textsc{ZeroGen}. Given a zero-shot task, we first generate a dataset from scratch using PLMs in an unsupervised manner. Then, we train a tiny task model (e.g., LSTM) under the supervision of the synthesized dataset. This approach allows highly efficient inference as the final task model only has orders of magnitude fewer parameters comparing to PLMs (e.g., GPT2-XL). Apart from being annotation-free and efficient, we argue that \textsc{ZeroGen} can also provide useful insights from the perspective of data-free model-agnostic knowledge distillation, and unreferenced text generation evaluation. Experiments and analysis on different NLP tasks, namely, text classification, question answering, and natural language inference, show the effectiveness of \textsc{ZeroGen}.

PDF Abstract

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

The paper "ZeroGen: Efficient Zero-shot Learning via Dataset Generation" presents a novel approach to zero-shot learning in the context of NLP tasks. This approach leverages the generative capabilities of large pre-trained LLMs (PLMs) to generate datasets from scratch, which can then be used to train smaller, task-specific models with significantly fewer parameters.

Summary and Numerical Results

The ZeroGen framework proposes an innovative solution to zero-shot learning by creating synthetic datasets using PLMs. Specifically, the process entails generating training data for a given task through unsupervised methods, using PLMs guided by carefully crafted prompts. These syntactically generated datasets enable the training of tiny task models (TAMs), such as LSTMs, which are orders of magnitude smaller than the PLMs themselves.

The authors conducted extensive experiments on different NLP tasks, including text classification, question answering, and natural language inference, using datasets like SST-2, IMDb, SQuAD, QNLI, and RTE. The TAMs trained within the ZeroGen framework demonstrated superior performance against the PLMs using a prompt-based zero-shot method. Notably, the TAMs achieved better results in zero-shot conditions with only ~0.4% of the parameters as compared to larger PLMs like GPT2-XL.

A standout reported result is that in certain low-resource settings, TAMs trained using the ZeroGen-generated data outperformed those trained with human annotations. Additionally, when utilizing larger-scale PLMs for dataset generation, a notable improvement in downstream task performance was observed, indicating the preserved knowledge within PLMs can be effectively harnessed in this novel framework.

Implications and Theoretical Contributions

ZeroGen, by relying entirely on synthetic data, provides a model-agnostic approach to data-free knowledge distillation. This method circumvents the prerequisite of human-annotated data, aligning with objectives of reducing cost in ML infrastructure, particularly in inference rather than training.

Furthermore, ZeroGen offers a new perspective on unreferenced text generation evaluation: the quality of machine-generated text directly influences task performance, serving as an indirect evaluation measure for generation models and protocols. The paper highlights, through its analysis, that parameters in sampling strategy, such as Top-k or nucleus sampling, influence the diversity and quality of the generated datasets.

ZeroGen also revisits prompt engineering, exposing the challenges and key insights in designing prompts that adequately leverage human knowledge or instructions for specific tasks. The paper reveals that natural language-style prompts tend to yield better generation quality over control-code prompts, underscoring the importance of linguistic alignment with the PLM's trained corpus.

Future Directions

The prospects of ZeroGen as a zero-shot learning paradigm extend towards its potential improvements and applications. Though promising, the approach reveals the variability in prompt efficacy across different tasks, suggesting further work in multi-task prompt-based pre-training. The paper also hints at optimizing decoding strategies to improve the correctness-dominant diversity in dataset generation. Methods that learn under noisy conditions can be integrated into TAM training to further refine model performance.

Conclusively, ZeroGen's findings advocate for a substantial shift in how AI models can be trained efficiently and sustainably, spotlighting the capabilities of PLMs in democratizing robust zero-shot performance in NLP. This research provides substantial groundwork for enhancing the synthesis and application of training data in various machine learning contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Jiacheng Ye (21 papers)
Jiahui Gao (25 papers)
Qintong Li (17 papers)
Hang Xu (205 papers)
Jiangtao Feng (24 papers)
Zhiyong Wu (171 papers)
Tao Yu (282 papers)
Lingpeng Kong (134 papers)

Citations (181)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - HKUNLP/ZeroGen: [EMNLP 2022] Code for our paper “ZeroGen: Efficient Zero-shot Learning via Dataset Generation”. (16 stars)