Data Curation Alone Can Stabilize In-context Learning
The paper "Data Curation Alone Can Stabilize In-context Learning" by Ting-Yun Chang and Robin Jia, presents methodologies to enhance the stability of In-context Learning (ICL) in LLMs by employing data curation techniques. ICL is employed for few-shot learning by conditioning LLMs on a sequence of labeled training examples, enabling task performance without direct parameter updates.
Sensitivity in ICL and Proposed Solutions
ICL has demonstrated sensitivity to the selection of training examples, where randomly selected examples can lead to variable performance. Existing strategies involve prompt retrieval and calibration, yet this work highlights that careful data curation alone can stabilize ICL. Two data selection methods are introduced: CondAcc and Datamodels.
- CondAcc: This method ascribes a score to each training example based on its average dev-set ICL accuracy when paired with random training samples. This method’s scoring is inspired by Data Shapley values, determining a training example's contribution by observing contextual synergy.
- Datamodels: Linear regressors predict the influence each training example has on the LLM outputs. The presence of examples is assessed in different indices within prompts, and examples are favored if they contribute positively across various indices and datasets.
These methods select high-performing examples when compared to conventional random sampling or prompt-retrieval strategies.
Empirical Results
The paper encompasses five tasks across two LLMs, GPTJ-6B and OPT-13B. The selected stable subsets via CondAcc and Datamodels reveal a significant boost in average accuracy, improving by 7.7% and 6.3% respectively, compared to sampling from an unsorted training set. The findings contradict existing assertions that diversity and low perplexity are integral for effective prompting.
Implications and Future Work
The implications of this research are twofold: practically, by reducing the reliance on dynamic prompt retrieval, it simplifies deployment in production environments. Theoretically, this paper paves the way for further exploration into how curated datasets can consistently offer robust results with LLMs.
In conclusion, the research underscores the influential role of data curation in refining ICL, establishing a foundation for future inquiries in extending such techniques to a broader range of AI applications, including generative tasks. Subsequent investigations might explore the intersection of curated data and more complex language generation scenarios, shedding further light on optimizing LLMs for both performance and interpretability.