Investigating Data Contamination for Pre-training Language Models (2401.06059v1)

Published 11 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on LLM capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

PDF HTML Abstract

Introduction

LLMs are impressive computational entities that have shown remarkable performance on various tasks. While their sophisticated algorithms and expansive training sets account for much of their success, the issue of data contamination, specifically the inclusion of evaluation data in the training set, has started to raise concerns. This analysis is central to truly understanding the effectiveness and integrity of these models.

Contamination Implications

Recent observations point towards the potential of training data to contain slices of the very datasets used to evaluate these LLMs. The presence of such "contaminated" data can skew results, misleading us about a model's true capabilities. This paper meticulously distinguishes between text contamination, where evaluation texts themselves are within the training set, and ground truth contamination, which includes both prompts and expected outputs used in evaluations. Understanding these distinctions and their effects is critical in evaluating the true performance of LLMs.

Experimental Approach

The researchers' approach involves the novel pre-training of GPT-2 models, with meticulous control over data contamination levels. The paper considers different forms and repetition frequencies of contamination to assess its impact comprehensively. It also scrutinizes common n-gram-based contamination definitions found in existing LLM reports, revealing the potential inadequacy of such definitions for contamination detection and model assessment.

Findings and Recommendations

The findings of this paper are revealing. Data contamination, especially when it includes ground truths, can significantly enhance models' performance, more so than mere text contamination. This paints a complex picture of the role of data purity on evaluation results. Remarkably, when contamination is repeatedly introduced, a U-shaped performance trend emerges, with model performance peaking and then declining as contamination increases. This suggests a nuanced relationship between performance and data contamination frequency.

In concluding, this research calls attention to the need for refined contamination definitions and robust assessment methodologies. The paper's insights warn of the risks of data contamination and encourage the development of more stringent controls and transparency in LLM training environments.

Acknowledgements

Finally, the authors acknowledge the various supporters of this research, from DARPA and the National Science Foundation to institutions like Google Inc. and the Alfred P. Sloan Foundation. Without such multidimensional support, this insightful investigation into the intricate workings of LLMs and the effects of data contamination would not be possible.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (7)

Minhao Jiang (10 papers)
Ken Ziyu Liu (6 papers)
Ming Zhong (88 papers)
Rylan Schaeffer (33 papers)
Siru Ouyang (22 papers)
Jiawei Han (263 papers)
Sanmi Koyejo (110 papers)

Citations (44)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/minhaoj_uiuc/status/1746938689459032476

https://twitter.com/kenziyuliu/status/1747077178213478846

https://twitter.com/fly51fly/status/1746533918961246255

https://twitter.com/NSFiGUIDE/status/1763202467779539348