Understanding In-Context Learning via Supportive Pretraining Data (2306.15091v1)
Abstract: In-context learning (ICL) improves LLMs' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of LLMs by actively guiding the construction of pretraining data in the future.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661.
- Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pages 259–262.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- What can transformers learn in-context? a case study of simple function classes. arXiv preprint arXiv:2208.01066.
- Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
- Pre-training to learn in context.
- Simfluence: Modeling the influence of individual training examples by simulating training runs. ArXiv, abs/2303.08114.
- Xiaochuang Han and Yulia Tsvetkov. 2022. Orca: Interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600.
- On the blind spots of model-based evaluation metrics for text generation. arXiv preprint arXiv:2212.10020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proc. ICML.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP.
- Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
- Fully sharded data parallel: faster ai training with fewer gpus. https://engineering.fb.com/2021/07/15/open-source/fsdp/.
- What in-context learning"learns"in-context: Disentangling task recognition and task learning.
- Language models as knowledge bases? In Proc. EMNLP.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Proc. NeurIPS.
- Estimating training data influence by tracking gradient descent. In Proc. NeurIPS.
- Impact of pretraining term frequencies on few-shot reasoning. ArXiv, abs/2202.07206.
- Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proc. EACL.
- On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv preprint arXiv:2204.13509.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Attention is all you need. Advances in neural information processing systems, 30.
- Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677.
- Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705.
- Larger language models do in-context learning differently. ArXiv, abs/2303.03846.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
- Tweetqa: A social media focused question answering dataset. arXiv preprint arXiv:1907.06292.
- Robustness of demonstration-based learning under limited data scenario. arXiv preprint arXiv:2210.10693.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Character-level convolutional networks for text classification. In Proc. NeurIPS.
- Xiaochuang Han (23 papers)
- Daniel Simig (10 papers)
- Todor Mihaylov (23 papers)
- Yulia Tsvetkov (142 papers)
- Asli Celikyilmaz (80 papers)
- Tianlu Wang (33 papers)