Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Language Models are Few-Shot Learners (2005.14165v4)

Published 28 May 2020 in cs.CL

Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up LLMs greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive LLM with 175 billion parameters, 10x more than any previous non-sparse LLM, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Citations (34,484)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that GPT-3, with 175B parameters, achieves competitive few-shot performance without any task-specific fine-tuning.
  • It utilizes a deep 96-layer transformer architecture and scales performance predictably with increased model size and data.
  • The experimental evaluation confirms robust performance across tasks like translation, comprehension, and arithmetic while highlighting some limitations.

LLMs are Few-Shot Learners

This paper presents the architecture and performance characteristics of GPT-3, a large autoregressive LLM with 175 billion parameters. This model demonstrates remarkable few-shot learning capabilities, where it performs tasks based on minimal context given only in the input sequence without parameter updates. Below, we explore the methods, experimental setups, and implications of GPT-3 as a few-shot learner.

Model Architecture and Training

GPT-3 utilizes a transformer-based autoregressive model architecture similar to its predecessors in the GPT family but scaled significantly. It features 96 transformer layers, with 96 attention heads per layer, each head having a dimensionality of 128. The model was trained on 300 billion tokens gathered from a carefully filtered and deduplicated dataset, comprising Common Crawl, WebText2, Books1, Books2, and Wikipedia.

Notably, GPT-3 does not employ task-specific fine-tuning. Instead, it leverages in-context learning, where the model adapts to new tasks using examples given in the prompt. The training involved optimizing the model with Adam with a high learning rate and large batch sizes, supported by extensive parallelism across GPU clusters.

Experimental Evaluation and Performance

GPT-3 is assessed on various benchmarks across zero-shot, one-shot, and few-shot settings to evaluate its ability to perform without fine-tuning. The few-shot task examples highlight the model's capacity to understand tasks from minimal context and exhibit competitive performance with state-of-the-art fine-tuned systems. Figure 1

Figure 1: Aggregate performance for all 42 accuracy-denominated benchmarks~~~While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning.

Key findings from the experiments include:

  • Zero-shot and Few-shot Learning: GPT-3 achieves impressive zero-shot and few-shot performance on NLP tasks such as translation, comprehension, and question-answering, often matching or surpassing fine-tuned baselines.
  • Scaling Laws: The performance improves predictably with the scaling of both model size and computation, consistent with established scaling laws for LLMing.
  • Task Versatility: Provides robust performance on diverse tasks, including arithmetic and word unscrambling, showing generalized capabilities in handling unseen challenges.

Implications and Future Research

The results indicate that larger models like GPT-3 can effectively learn new tasks during inference without traditional training or fine-tuning. This paradigm shift towards few-shot learning opens avenues for developing models with broader applicability and reduced need for large labeled datasets for each task.

While GPT-3 demonstrates compelling capabilities, it still faces limitations like coherence in long text generation and biases inherited from training data. Addressing these while improving efficiency and exploring multi-modal integration and more grounded training objectives remains promising in advancing AI. Figure 2

Figure 2: People's ability to identify whether news articles are model-generated (measured by the ratio of correct assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberately-bad control model (an unconditioned GPT-3 Small model with higher output randomness) is indicated with the dashed line at the top, and the random chance (50\%) is indicated with the dashed line at the bottom. Line of best fit is a power law with 95% confidence intervals.

Conclusion

GPT-3 sets a benchmark for few-shot learning in LLMs, demonstrating that larger models can achieve adaptability and performance close to fine-tuned systems through in-context learning. This capability signifies a pivotal advancement, potentially reducing the labor-intensive task-specific dataset creation burden in NLP. Future directions will focus on addressing remaining limitations and expanding upon this paradigm for widespread real-world applications.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews