It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners (2009.07118v2)

Published 15 Sep 2020 in cs.CL, cs.AI, and cs.LG

Abstract: When scaled to hundreds of billions of parameters, pretrained LLMs such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with LLMs that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small LLMs.

PDF Abstract

Few-Shot Learning with Small LLMs: A Closer Look

In contemporary NLP, the trend of utilizing large pretrained LLMs (LMs) has resulted in remarkable performance enhancements across various tasks. Several studies, including the influential work of Brown et al. (2020) on GPT-3, underscore the efficacy of massive LMs in few-shot learning scenarios. However, the computational resources and environmental costs associated with such large-scale models pose significant challenges. This paper by Schick and Schütze, titled "It's Not Just Size That Matters: Small LLMs Are Also Few-Shot Learners," investigates an alternative: the use of significantly smaller LMs for few-shot learning through a technique known as Pattern-Exploiting Training (Pet) and its iterative variant, iPet.

Key Contributions

Performance with Smaller LMs: The core assertion of this paper is that smaller LMs, when coupled with the Pet method, can achieve performance comparable to GPT-3 on the SuperGLUE benchmark. Specifically, the authors demonstrate that models such as ALBERT-xxlarge-v2 with Pet not only reach but, in some cases, surpass the accuracy levels of GPT-3, despite having two-to-three orders of magnitude fewer parameters.
Pattern-Exploiting Training (Pet): The Pet framework leverages cloze-style questions to reformulate NLP tasks, facilitating effective utilization of smaller LMs. For modeling, the authors use masked LLMs (MLMs) like ALBERT, RoBERTa, and GPT-2. Key to Pet's success is its ability to incorporate multiple pattern-verbalizer pairs (PVPs), thereby enhancing model robustness against suboptimal task formulations.
Iterative Improvement (iPet): An iterative extension of Pet, iPet, further improves performance by progressively refining the model with pseudo-labeled data. This iterative training involves multiple generations where each generation's predictions inform the next, leading to convergence and stability in model performance.

Strong Numerical Results

The paper presents empirical evidence supporting its claims through extensive experimentation on the SuperGLUE tasks. Notably:

SuperGLUE Performance: Pet with ALBERT achieves an average performance close to or exceeding GPT-3 on several tasks, such as CB (92.4 F1), RTE (74.0% accuracy), and MultiRC (77.3 F1).
Comparison with Other Models: The paper includes comparative analyses showing that Pet models using ALBERT outstrip the performance of smaller GPT variants (e.g., GPT-Medium and GPT-Large) by 18 points on average.

Practical and Theoretical Implications

Scalability and Accessibility: The findings have substantial practical implications. Smaller LMs with Pet can democratize advanced NLP capabilities by making sophisticated models accessible to researchers and practitioners with limited computational resources. The environmental benefits of this approach also align with the goals of Green AI by reducing carbon footprint.
Flexibility in Task Reformulation: Pet's ability to work with diverse PVPs underscores the importance of flexible task representations in NLP workflows. The method's resilience to less optimal formulations is particularly beneficial in few-shot settings where extensive optimization is infeasible.
Unlabeled Data Utilization: The paper emphasizes the strategic use of unlabeled data in augmenting few-shot learning performance. This aspect is crucial in real-world applications where labeled data might be scarce but unlabeled data is more readily available.

Future Directions

The investigation opens several avenues for future research:

Generative Extensions: Applying Pet to generative LMs and exploring its efficacy in tasks beyond classification could be fruitful.
Multi-task Learning: Expansion of Pet to multi-task training paradigms where a single model is fine-tuned across multiple datasets could yield cross-task generalizations.
Alternative Architectures: Examining the integration of Pet with more recently proposed Transformer variants optimized for longer context processing, such as Reformer or Longformer, could address the context length limitations in few-shot learning scenarios.

Conclusion

This research contributes meaningfully to our understanding of how small LLMs can be effectively leveraged in few-shot learning scenarios. By coupling ALBERT and similar models with innovative training techniques like Pet and iPet, the authors have demonstrated substantial advancements in NLP performance without relying on excessively large model sizes. This work not only enhances the practicality and environmental sustainability of NLP models but also sets a foundation for future explorations into more efficient and accessible AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Timo Schick (31 papers)
Hinrich Schütze (250 papers)

Citations (909)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos