Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Curation Alone Can Stabilize In-context Learning (2212.10378v2)

Published 20 Dec 2022 in cs.CL

Abstract: In-context learning (ICL) enables LLMs to perform new tasks by prompting them with a sequence of training examples. However, it is known that ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm (e.g., prompt retrieval or calibration). We introduce two methods to choose training subsets -- both score training examples individually, then select the highest-scoring ones. CondAcc scores a training example by its average dev-set ICL accuracy when combined with random training examples, while Datamodels learns linear regressors that estimate how the presence of each training example influences LLM outputs. Across five tasks and two LLMs, sampling from stable subsets selected by CondAcc and Datamodels improves average accuracy over sampling from the entire training set by 7.7% and 6.3%, respectively. Surprisingly, the stable subset examples are not especially diverse in content or low in perplexity, in contrast with other work suggesting that diversity and perplexity are important when prompting LLMs.

Data Curation Alone Can Stabilize In-context Learning

The paper "Data Curation Alone Can Stabilize In-context Learning" by Ting-Yun Chang and Robin Jia, presents methodologies to enhance the stability of In-context Learning (ICL) in LLMs by employing data curation techniques. ICL is employed for few-shot learning by conditioning LLMs on a sequence of labeled training examples, enabling task performance without direct parameter updates.

Sensitivity in ICL and Proposed Solutions

ICL has demonstrated sensitivity to the selection of training examples, where randomly selected examples can lead to variable performance. Existing strategies involve prompt retrieval and calibration, yet this work highlights that careful data curation alone can stabilize ICL. Two data selection methods are introduced: CondAcc and Datamodels.

  1. CondAcc: This method ascribes a score to each training example based on its average dev-set ICL accuracy when paired with random training samples. This method’s scoring is inspired by Data Shapley values, determining a training example's contribution by observing contextual synergy.
  2. Datamodels: Linear regressors predict the influence each training example has on the LLM outputs. The presence of examples is assessed in different indices within prompts, and examples are favored if they contribute positively across various indices and datasets.

These methods select high-performing examples when compared to conventional random sampling or prompt-retrieval strategies.

Empirical Results

The paper encompasses five tasks across two LLMs, GPTJ-6B and OPT-13B. The selected stable subsets via CondAcc and Datamodels reveal a significant boost in average accuracy, improving by 7.7% and 6.3% respectively, compared to sampling from an unsorted training set. The findings contradict existing assertions that diversity and low perplexity are integral for effective prompting.

Implications and Future Work

The implications of this research are twofold: practically, by reducing the reliance on dynamic prompt retrieval, it simplifies deployment in production environments. Theoretically, this paper paves the way for further exploration into how curated datasets can consistently offer robust results with LLMs.

In conclusion, the research underscores the influential role of data curation in refining ICL, establishing a foundation for future inquiries in extending such techniques to a broader range of AI applications, including generative tasks. Subsequent investigations might explore the intersection of curated data and more complex language generation scenarios, shedding further light on optimizing LLMs for both performance and interpretability.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ting-Yun Chang (10 papers)
  2. Robin Jia (59 papers)
Citations (49)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com