It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Published 15 Sep 2020 in cs.CL, cs.AI, and cs.LG | (2009.07118v2)

Abstract: When scaled to hundreds of billions of parameters, pretrained LLMs such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with LLMs that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small LLMs.

Abstract PDF Upgrade to Chat

Citations (909)

View on Semantic Scholar

Summary

The paper introduces pattern-exploiting training (Pet) that reformulates NLP tasks into cloze questions, enabling efficient few-shot learning.
The paper demonstrates that models with only 0.1% of GPT-3’s parameters can match or exceed SuperGLUE benchmark performance.
The paper employs knowledge distillation and autoregressive decoding for multi-token outputs, improving model efficiency and reducing environmental impact.

Summary of "It's Not Just Size That Matters: Small LLMs Are Also Few-Shot Learners"

The paper addresses the significant resource demands of LLMs such as GPT, which, despite their remarkable performance in natural language understanding (NLU) tasks, are not feasible for most practitioners due to high computational costs and environmental impacts. It proposes the use of pattern-exploiting training (Pet) as a more efficient alternative, achieving comparable performance with significantly smaller models.

Methodology

Pattern-Exploiting Training (Pet)

The core technique explored in the paper is Pet, which transforms traditional NLP tasks into cloze-style questions. This method allows smaller LLMs to excel in few-shot learning scenarios. Pet operates by:

Task Reformulation: Tasks are reformulated into cloze questions using a set of pattern-verbalizer pairs (PVPs). Each PVP consists of a pattern that maps inputs to cloze questions and a verbalizer that maps outputs to specific tokens.
Model Training: Multiple models are finetuned on these tasks using a combination of labeled and unlabeled data. This involves applying gradient descent over predicted likelihoods for correct answers.
Knowledge Distillation: Ensembling the knowledge from multiple task formulations into a single classifier through a distillation process involving soft labels predicted on unlabeled data.

Modified Pet for Multi-token Outputs

To adapt Pet for tasks that require multiple-token answers, the paper introduces a method to generalize the prediction to multiple masks, employing an autoregressive decoding approach to maintain efficiency. This involves choosing the next token based on probability, improving upon the basic parallel decoding strategy.

Experimental Results

Performance Metrics

The paper evaluates Pet and its iterative version Pet on the SuperGLUE benchmark using ALBERT-xxlarge-v2, demonstrating that with only 0.1% of GPT-3's parameters, models trained with Pet outperform it across several tasks. Key results include:

SuperGLUE Performance: Pet shows equivalent or superior performance compared to GPT-3 across various tasks. For example, Pet achieves higher accuracy on the BoolQ and CB tasks with significantly fewer parameters.
Comparison with Other Models: Pet surpasses GPT models of equivalent size in most scenarios and reveals that the choice of PVPs significantly impacts performance.
Impact of Unlabeled Data: Unlabeled data bolsters the effectiveness of knowledge distillation in Pet, allowing the combination of insights from multiple task formulations.

Implications and Future Work

The practical implications of this research are twofold: reducing the carbon footprint and financial costs associated with deploying large-scale LLMs while maintaining competitive performance. The paper advocates for further exploration in multi-task settings and adapting Pet for generative tasks with larger, more expressive models.

The future directions suggested include integrating generative models into Pet and refining task formulations to optimize performance further. Additionally, exploring these methods' applicability to broader AI subfields could significantly enhance sustainable AI practices. In conclusion, this work presents a viable path towards more accessible and environmentally sound AI technologies without sacrificing performance.

Markdown