Calibrate Before Use: Improving Few-Shot Performance of Language Models (2102.09690v2)

Published 19 Feb 2021 in cs.CL and cs.LG

Abstract: GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of LLMs towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.

PDF Abstract

Improving Few-Shot Performance of LLMs through Contextual Calibration

The paper "Calibrate Before Use: Improving Few-Shot Performance of LLMs" by Tony Z. Zhao et al. offers a meticulous analysis of the instability in few-shot learning with LLMs (LMs) such as GPT-3. The key argument presented is the inherent volatility of few-shot learning performance due to biases in LMs. The researchers propose a novel method—contextual calibration—to address these issues, providing evidence of its efficacy through empirical results.

Overview and Key Findings

Few-shot learning with LMs, particularly without finetuning, relies heavily on the design of natural language prompts that include limited training examples within the prompt itself. This approach, however, exhibits significant variability in performance influenced by different factors: the choice of training examples, their order, and the format of the prompts. The researchers meticulously identified three primary biases contributing to this instability:

Majority Label Bias: Predominance of certain answers due to their frequency in prompts.
Recency Bias: Propensity towards predicting answers appearing near the end of the prompt.
Common Token Bias: Inclination towards frequent tokens from pre-training data.

Methodological Innovation: Contextual Calibration

To mitigate these identified biases, the authors introduced "contextual calibration," a technique whereby they estimate the model's bias towards each answer using a content-free input (e.g., "N/A"). The calibration process involves:

Estimating Model Bias: Utilizing a content-free input to determine the model’s baseline predictions.
Calibrating Output Probabilities: Adjusting the predictions to be uniform across potential answers by applying an affine transformation to the output probabilities.

Empirically, this calibration process resulted in a substantial improvement in average accuracy (up to 30.0% absolute) and reduced variance across different prompt configurations for both GPT-3 and GPT-2 models.

Experimental Results and Analysis

The experiment spanned diverse tasks, encompassing text classification, fact retrieval, and information extraction across multiple datasets. Significant results included:

AGNews: GPT-3 2.7B accuracy improved from 33.0% to 59.6% for 1-shot learning.
SST-2 Sentiment Analysis: The accuracy increased from 67.3% to 79.1% for 1-shot learning and from 93.3% to 94.7% for GPT-3 175B.
LAMA Fact Retrieval: GPT-3 2.7B showed an accuracy boost from 14.0% to 22.7% in the zero-shot setting.

These findings underscore the robustness of contextual calibration across tasks and its potential to significantly enhance LM performance in few-shot scenarios.

Implications and Future Research Directions

The implications of this research are multifaceted. Practically, contextual calibration reduces the need for extensive prompt engineering, making it easier for practitioners to achieve high performance with fewer resources. Theoretically, this work suggests that LM biases can be systematically corrected to enhance model reliability and performance.

Future research could explore the interplay between contextual calibration and finetuning, potentially merging the strengths of both techniques. Additionally, extending the calibration approach to more diverse tasks and exploring its impact on tasks involving open-ended generation remains a promising avenue.

Concluding Remarks

The proposed contextual calibration method addresses a critical challenge in deploying LMs for few-shot learning tasks. By rigorously analyzing the biases inherent in LMs and presenting a practical solution, this research advances our understanding of model behavior and paves the way for more reliable AI applications. The substantial numerical improvements highlighted in the paper underscore the efficacy of this approach and its potential to transform the utilization of LLMs in various applications.