Emergent Mind

Language Models are Few-Shot Learners

Published May 28, 2020 in cs.CL


Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Decreasing ability to identify model-generated news with increasing model size; accuracy varies by model randomness.


  • The paper presents an in-depth exploration of GPT-3, a 175 billion parameter autoregressive language model, highlighting its impressive few-shot learning capabilities across diverse NLP tasks.

  • GPT-3 demonstrates significant performance improvements in benchmarks such as LAMBADA, TriviaQA, and SuperGLUE, and shows robust capabilities in multilingual translation without fine-tuning.

  • Despite its strengths, GPT-3 has limitations including underperformance in tasks requiring bidirectional context, risks of training data contamination, and biases influenced by its training data, raising important ethical and practical considerations.

An Analysis of GPT-3: Capabilities, Limitations, and Implications

The paper presents an in-depth exploration of GPT-3, a 175 billion parameter autoregressive language model, designed to advance the capabilities of language processing systems significantly. Notably, GPT-3 marks a considerable scaling effort over previous non-sparse language models, enabling it to exhibit strong few-shot learning capabilities on a diverse set of NLP tasks. This discussion outlines the core contributions of the paper, underlining the model's performance, limitations, and broader implications for both the academic field and societal use.

Overview and Methodology

The model was pre-trained on a vast and diverse corpus of text from the internet, utilizing a mixture of high-quality datasets, including filtered Common Crawl, WebText2, Books1, Books2, and English-language Wikipedia. Significantly, the training methodology abstained from leveraging task-specific architectures, relying solely on a task-agnostic approach. This generality in the model's architecture means GPT-3's performance hinges primarily on the scale of pre-training data and model size.

Performance was evaluated across a broad spectrum of benchmarks in zero-shot, one-shot, and few-shot settings, avoiding traditional fine-tuning. The authors offered a systematic comparison involving multiple model sizes, amplifying the clarity of how scaling parameters enhances certain capabilities.

Numerical Performance Highlights

GPT-3 showcases notable improvements in NLP benchmarks, particularly under few-shot learning configurations. Key performances include:

  • LAMBADA: GPT-3 demonstrated a few-shot accuracy of 86.4%, outperforming the previous state-of-the-art by 18.4 percentage points, thus highlighting its situational context comprehension capabilities.
  • TriviaQA: The model achieved a 71.2% accuracy in few-shot settings, exceeding fine-tuned models designed for closed-book systems.
  • SuperGLUE: GPT-3 few-shot averaged 71.8%, closely rivaling models fine-tuned on extensive supervised datasets.
  • Translation Tasks: Few-shot performance outstripped prior unsupervised NMT for several language pairs, indicating GPT-3's robust capability for multilingual translation without fine-tuning.

Noteworthy is GPT-3's in-context learning, which enables the model to perform new tasks by simply observing examples within the prompt at test time. This was evidenced by substantial gains in zero-shot to few-shot evaluations across tasks, such as PIQA and reading comprehension datasets like CoQA.

Limitations and Challenges

Despite its extensive capabilities, GPT-3 reveals several limitations that warrant further investigation:

  • Bidirectionality and Task-Specific Performance: The model underperforms in tasks requiring bidirectional context comprehension, such as certain reading comprehension and comparison tasks (e.g., ANLI). This suggests the potential utility of integrating bidirectional objectives alongside autoregressive training.
  • Contamination: The impact of training data contaminations, such as overlapping with test datasets like LAMBADA and PIQA, poses a risk of result inflation. The authors' efforts to identify and mitigate this issue underscore the complexities in managing web-scale datasets.
  • Bias and Fairness: Preliminary analyses reveal that GPT-3 inherits biases prevalent in its training data, reflecting societal stereotypes across gender, race, and religion. Addressing these biases is crucial for responsible deployment in sensitive applications.

Broader Implications

The implications of GPT-3 extend beyond enhancing NLP tasks. The model has significant potential for both beneficial and harmful applications, from improving automated assistance systems to enabling sophisticated, automated misinformation dissemination.

Practical and Ethical Considerations

  1. Misuse Potential: As language models like GPT-3 become more proficient in generating human-like text, the risk of misuse for generating misleading or harmful content increases. Ongoing dialogue and development of mitigatory frameworks are necessary.
  2. Bias Mitigation: Given the entrenchment of biases in large-scale language models, there is an imperative for developing methodologies to detect, understand, and mitigate discriminatory tendencies in generated content.
  3. Energy Efficiency: The substantial computational resources required for training models at this scale necessitate consideration of environmental and economic impacts. Future research could focus on efficient training methods and model distillation strategies to ameliorate these concerns.

Future Directions

Exploring bidirectional training techniques, refining in-context learning algorithms, and broadening the range of tasks and modalities integrated with language models are promising avenues for advancing the capabilities demonstrated by GPT-3. The ongoing challenge will be to balance scaling benefits with interpretability, fairness, and the responsible use of AI technologies.

In conclusion, GPT-3 signifies a substantial stride in the evolution of language models, demonstrating the powerful potential of scaling up model size and training data. However, it simultaneously unveils new challenges and responsibilities in the development and deployment of AI systems within society. The balance between innovation and ethical practice will be pivotal in steering the future trajectory of AI research and its applications.

Get summaries of Trending computer science papers delivered straight to your inbox

Unsubscribe anytime.

Language Models Are Few-Shot Learners (2 points, 0 comments)