Investigating Data Contamination for Pre-training Language Models (2401.06059v1)
Abstract: LLMs pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on LLM capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.
- Palm 2 technical report, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Membership inference attacks from first principles, 2022.
- Quantifying memorization across neural language models, 2023.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Palm: Scaling language modeling with pathways, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- Time travel in llms: Tracing data contamination in large language models, 2023.
- Textbooks are all you need, 2023.
- Measuring massive multitask language understanding, 2021.
- An empirical analysis of compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pages 28–53. Association for Computational Linguistics, 2023.
- Membership inference attack susceptibility of clinical language models, 2021.
- Pretraining language models with human preferences, 2023.
- Textbooks are all you need ii: phi-1.5 technical report, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
- Data contamination: From memorization to exploitation, 2022.
- Membership inference on word embedding and beyond, 2021.
- Membership inference attacks against language models via neighbourhood comparison. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Quantifying privacy risks of masked language models using membership inference attacks, 2022.
- Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, August 2016. Association for Computational Linguistics.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Modern neural networks generalize on small data sets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- OpenAI. Gpt-4 technical report, 2023.
- Proving test set contamination in black box language models, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Detecting pretraining data from large language models, 2023.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- How far can camels go? exploring the state of instruction tuning on open resources, 2023.
- Rethinking benchmark and contamination for language models with rephrased samples, 2023.
- Towards a unified multi-dimensional evaluator for text generation, 2022.
- Minhao Jiang (10 papers)
- Ken Ziyu Liu (6 papers)
- Ming Zhong (88 papers)
- Rylan Schaeffer (33 papers)
- Siru Ouyang (22 papers)
- Jiawei Han (263 papers)
- Sanmi Koyejo (111 papers)