What Makes Good In-Context Examples for GPT-$3$? (2101.06804v1)

Published 17 Jan 2021 in cs.CL

Abstract: GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-$3$'s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-$3$'s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-$3$ and large-scale pre-trained LMs in general and enhance their few-shot capabilities.

PDF Abstract

What Makes Good In-Context Examples for GPT-3?

The paper "What Makes Good In-Context Examples for GPT-3?" addresses an important aspect of leveraging LLMs, specifically focusing on the impact of in-context examples on the performance of GPT-3. The research scrutinizes how in-context examples, which are crucial for few-shot learning, can be selectively chosen to optimize GPT-3's performance across multiple natural language understanding and generation tasks.

Key Contributions and Findings

Problem Identification: The paper identifies a critical issue concerning the selection of in-context examples in the few-shot learning paradigm of GPT-3. Prior observations indicated that GPT-3’s performance can vary significantly depending on the chosen examples, thereby implying room for improving the in-context example selection process.
Proposed Method - KATE: The authors introduce KATE (k-Nearest Neighbor-Augmented in-conText Example selection), a retrieval-based method designed to select in-context examples that are semantically similar to the test samples. The method leverages pre-trained sentence encoders to determine the semantic similarity.
Empirical Validation: KATE is empirically validated on multiple benchmarks, including:
- Sentiment analysis (between SST-2 and IMDB datasets)
- Table-to-text generation (using the ToTTo dataset)
- Open-domain question answering (NQ, WQ, and TriviaQA datasets)
Superior Performance: The KATE methodology consistently outperformed random sampling baselines across all evaluated tasks. Notable improvements include:
- On the SST-2 to IMDB sentiment analysis task, KATE $_{\text{sst-2}}$ achieved an accuracy of 93.43%, substantially higher than the random baseline.
- For table-to-text generation on the ToTTo dataset, KATE improved BLEU scores to 40.3, compared to 28.4 from random selection.
- In question answering, KATE's performance gains were significant, reaching an EM score of 41.6 on NQ, compared to 28.6 for random selection.
Impact of Sentence Encoders: The research demonstrated that the effectiveness of KATE improves when sentence encoders are fine-tuned on task-related datasets. For instance, encoders fine-tuned on NLI and STS-B datasets resulted in better retrieval quality and improved GPT-3’s performance.
Ablation Studies: Detailed ablation studies were conducted to analyze the effect of various factors:
- The number of in-context examples: Performance improved with a higher number of examples.
- Size of the retrieval training set: Larger retrieval sets led to better performance.
- Order of in-context examples: The order had a minor impact, indicating robustness to this factor.

Implications and Future Directions

Practical Implications

Enhanced Few-Shot Learning: By improving the selection of in-context examples, practical applications relying on GPT-3, such as customer support chatbots, recommendation systems, and information retrieval, can achieve more consistent and higher performance without additional fine-tuning.
Efficiency: The ability to enhance GPT-3’s performance via better example selection without increasing computational overhead for fine-tuning is particularly beneficial for deployments in resource-constrained environments.

Theoretical Implications

Understanding Model Behavior: This work contributes to a nuanced understanding of how LLMs like GPT-3 internalize and utilize contextual information. This can pave the way for more refined models that inherently manage context better.
Retrieval-augmented Models: The findings bolster the efficacy of retrieval-augmented frameworks, suggesting that integrating such approaches can be beneficial for future generations of LLMs.

Speculative Future Developments

Dynamic In-Context Learning: Future models might dynamically select or generate in-context examples as part of their inference process, thus adapting better to the specific characteristics of each test sample.
Meta-Learning: Combining KATE with meta-learning paradigms could further enhance adaptability and performance consistency.
Interactive Model Training: Interactive systems where a model can query a database for the best context in real-time could further optimize performance, particularly in open-domain tasks.

Conclusion

The research presents a substantial advancement in optimizing GPT-3's few-shot learning capability by refining in-context example selection. By demonstrating the practicality and benefits of the KATE methodology across several benchmarks, the paper provides a robust framework for improving the utility of large-scale pre-trained LLMs without the need for additional fine-tuning, thus opening up new avenues for both practical applications and theoretical explorations.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jiachang Liu (12 papers)
Dinghan Shen (34 papers)
Yizhe Zhang (127 papers)
Bill Dolan (45 papers)
Lawrence Carin (203 papers)
Weizhu Chen (128 papers)

Citations (1,185)

View on Semantic Scholar