Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating Data Contamination for Pre-training Language Models (2401.06059v1)

Published 11 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on LLM capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Introduction

LLMs are impressive computational entities that have shown remarkable performance on various tasks. While their sophisticated algorithms and expansive training sets account for much of their success, the issue of data contamination, specifically the inclusion of evaluation data in the training set, has started to raise concerns. This analysis is central to truly understanding the effectiveness and integrity of these models.

Contamination Implications

Recent observations point towards the potential of training data to contain slices of the very datasets used to evaluate these LLMs. The presence of such "contaminated" data can skew results, misleading us about a model's true capabilities. This paper meticulously distinguishes between text contamination, where evaluation texts themselves are within the training set, and ground truth contamination, which includes both prompts and expected outputs used in evaluations. Understanding these distinctions and their effects is critical in evaluating the true performance of LLMs.

Experimental Approach

The researchers' approach involves the novel pre-training of GPT-2 models, with meticulous control over data contamination levels. The paper considers different forms and repetition frequencies of contamination to assess its impact comprehensively. It also scrutinizes common n-gram-based contamination definitions found in existing LLM reports, revealing the potential inadequacy of such definitions for contamination detection and model assessment.

Findings and Recommendations

The findings of this paper are revealing. Data contamination, especially when it includes ground truths, can significantly enhance models' performance, more so than mere text contamination. This paints a complex picture of the role of data purity on evaluation results. Remarkably, when contamination is repeatedly introduced, a U-shaped performance trend emerges, with model performance peaking and then declining as contamination increases. This suggests a nuanced relationship between performance and data contamination frequency.

In concluding, this research calls attention to the need for refined contamination definitions and robust assessment methodologies. The paper's insights warn of the risks of data contamination and encourage the development of more stringent controls and transparency in LLM training environments.

Acknowledgements

Finally, the authors acknowledge the various supporters of this research, from DARPA and the National Science Foundation to institutions like Google Inc. and the Alfred P. Sloan Foundation. Without such multidimensional support, this insightful investigation into the intricate workings of LLMs and the effects of data contamination would not be possible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Palm 2 technical report, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Membership inference attacks from first principles, 2022.
  4. Quantifying memorization across neural language models, 2023.
  5. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  6. Palm: Scaling language modeling with pathways, 2022.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  8. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
  9. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  10. Time travel in llms: Tracing data contamination in large language models, 2023.
  11. Textbooks are all you need, 2023.
  12. Measuring massive multitask language understanding, 2021.
  13. An empirical analysis of compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  14. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pages 28–53. Association for Computational Linguistics, 2023.
  15. Membership inference attack susceptibility of clinical language models, 2021.
  16. Pretraining language models with human preferences, 2023.
  17. Textbooks are all you need ii: phi-1.5 technical report, 2023.
  18. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  19. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  20. Data contamination: From memorization to exploitation, 2022.
  21. Membership inference on word embedding and beyond, 2021.
  22. Membership inference attacks against language models via neighbourhood comparison. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343, Toronto, Canada, July 2023. Association for Computational Linguistics.
  23. Quantifying privacy risks of masked language models using membership inference attacks, 2022.
  24. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, August 2016. Association for Computational Linguistics.
  25. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  26. Modern neural networks generalize on small data sets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Proving test set contamination in black box language models, 2023.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  31. Detecting pretraining data from large language models, 2023.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
  33. Llama: Open and efficient foundation language models, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models, 2023.
  35. How far can camels go? exploring the state of instruction tuning on open resources, 2023.
  36. Rethinking benchmark and contamination for language models with rephrased samples, 2023.
  37. Towards a unified multi-dimensional evaluator for text generation, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Minhao Jiang (10 papers)
  2. Ken Ziyu Liu (6 papers)
  3. Ming Zhong (88 papers)
  4. Rylan Schaeffer (33 papers)
  5. Siru Ouyang (22 papers)
  6. Jiawei Han (263 papers)
  7. Sanmi Koyejo (110 papers)
Citations (44)