True Few-Shot Learning with Language Models (2105.11447v1)

Published 24 May 2021 in cs.CL, cs.LG, and stat.ML

Abstract: Pretrained LLMs (LMs) perform well on many tasks even when learning from a few examples, but prior work uses many held-out examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts"). Here, we evaluate the few-shot ability of LMs when such held-out examples are unavailable, a setting we call true few-shot learning. We test two model selection criteria, cross-validation and minimum description length, for choosing LM prompts and hyperparameters in the true few-shot setting. On average, both marginally outperform random selection and greatly underperform selection based on held-out examples. Moreover, selection criteria often prefer models that perform significantly worse than randomly-selected ones. We find similar results even when taking into account our uncertainty in a model's true performance during selection, as well as when varying the amount of computation and number of examples used for selection. Overall, our findings suggest that prior work significantly overestimated the true few-shot ability of LMs given the difficulty of few-shot model selection.

PDF Abstract

True Few-Shot Learning with LLMs

The paper under review addresses a critical but often overlooked aspect of few-shot learning with pretrained LLMs: the challenge of model selection in scenarios where held-out examples are unavailable. This paper, co-authored by Ethan Perez, Douwe Kiela, and Kyunghyun Cho, introduces the notion of “true few-shot learning” to delineate a setting where the few-shot capabilities of LLMs (LMs) are evaluated without utilizing external validation data.

Framework and Methodology

In traditional few-shot learning, LLMs are expected to perform well on tasks with limited examples. However, previous studies have often relied on large validation sets to fine-tune hyperparameters and prompts, a practice which the authors argue skews the perceived efficacy of LMs. The paper distinguishes between different few-shot learning circumstances: multi-distribution few-shot learning, tuned few-shot learning, and true few-shot learning. The authors focus on true few-shot learning, where model selection relies solely on the examples provided, reflecting a more genuine low-data scenario.

The paper evaluates the efficacy of two standard model selection criteria: cross-validation (CV) and minimum description length (MDL), in the true few-shot setting. Both methods marginally improve over random selection but significantly underperform when compared to strategies utilizing held-out validation examples. The authors meticulously analyze these criteria across various configurations, including different amounts of computational resources and number of examples.

Key Findings

A crucial insight from the paper is that current few-shot learning techniques substantially overestimate the performance of LLMs when model selection leverages additional validation data. The paper shows that in the true few-shot context, CV and MDL selection criteria often falter, frequently choosing prompts that lead to worse performance than those selected randomly. This outcome accentuates a fundamental bottleneck in true few-shot model selection, exposing a possible gap between theoretical advances in LM capabilities and practical, real-world applications, especially in low-resource environments.

Empirical results demonstrate that as the size of the model grows, the reliability of prompt selection decreases—a finding with significant ramifications for the scaling of models. The authors present extensive experimental results, including evaluations on LAMA and several SuperGLUE datasets, reinforcing their claim that model selection is a significant challenge across a variety of tasks and data settings.

Implications and Future Directions

The implications of this research are manifold. On a theoretical level, it challenges the community to reconsider the benchmarks and metrics used to assess few-shot learning performance. Practically, it suggests a need for more robust model selection methods that are effective in low-data scenarios. Additionally, the findings prompt questions about the scalability of LMs and the feasibility of deploying them in true low-resource settings without relying on auxiliary validation data.

Future work could focus on developing novel selection criteria or algorithms that bypass the need for large validation sets while still achieving competitive performance. Moreover, researchers might explore alternative approaches, such as leveraging meta-learning, transfer learning, or unsupervised learning techniques, to enhance few-shot capabilities in true low-data conditions.

In conclusion, this paper serves as a crucial reminder of the complexities hidden within model selection for few-shot learning. It invites reflection not only on the effectiveness of current methodologies but also on the broader goals of the machine learning community in creating systems that can operate within the genuine constraints of real-world data scarcity.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ethan Perez (55 papers)
Douwe Kiela (85 papers)
Kyunghyun Cho (292 papers)

Citations (401)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/kchonyc/status/1745618492747858027

https://twitter.com/haileysch__/status/1764701210156711969