Time Travel in LLMs: Tracing Data Contamination in Large Language Models (2308.08493v3)

Published 16 Aug 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: Data contamination, i.e., the presence of test data from downstream tasks in the training data of LLMs, is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

PDF Abstract

Data Contamination in LLMs: Detection and Implications

The paper, "Time Travel in LLMs: Tracing Data Contamination in LLMs," introduces a method for identifying data contamination in LLMs such as GPT-4 and GPT-3.5. Data contamination refers to instances when test data from downstream tasks unintentionally ends up in the training data of LLMs, potentially skewing their effectiveness and performance evaluation. This paper proposes a robust, cost-effective method to detect such contamination, emphasizing the need for accurate evaluation techniques free from inflated benchmarks due to contaminated datasets.

Approach and Methodology

The authors employ a strategy focusing on detecting contamination at the instance level, which is later generalized to partition-level contaminations. The approach comprises a sequence of steps:

Guided Instruction: The method starts with a guided instruction that prompts the LLM using metadata, such as the dataset name, partition type, an initial segment of a reference instance, and its label if available. This prompt guides the LLM to reproduce the subsequent section of the reference text, assuming contamination in the training data.
General Instruction: To establish a baseline for contamination, a general instruction—lacking dataset-specific metadata—is used. This allows a comparison between outputs from both instructions to highlight the influence of guided cues on completion accuracy.
Assessment and Evaluation: The paper introduces two evaluation algorithms. The first assesses statistical significance in overlap scores using BLEURT and ROUGE-L between outputs from guided and general instructions. The second leverages GPT-4 in few-shot in-context learning prompts to detect exact or near-exact matches with reference instances based on human evaluations.

Detection of Partition-level Contamination

The proposed method extrapolates partition-level contamination from detected instance-level signals, employing criteria such as significant statistical differences in overlap scores and the detection of exact or near-exact matches. The robustness of this approach is demonstrated through experiments involving datasets across classification, summarization, and NLI tasks.

Through controlled contamination experiments with GPT-3.5, the authors validate their method's efficacy, highlighting that LLM-generated exact matches of dataset instances strongly indicate contamination.

Experimental Findings

The paper conducted evaluations using seven datasets on their training and test/validation splits with LLM snapshots from GPT-3.5 and GPT-4. The results show that:

The guided instruction paired with few-shot in-context learning using GPT-4 outperformed other methods, achieving high accuracy rates (100% for GPT-4, 92.86% for GPT-3.5) compared to human evaluation.
Existing LLMs such as GPT-4 demonstrated evident contamination with datasets like AG News and WNLI, emphasizing concerns over LLM evaluation benchmarks.
The comparison method, "ChatGPT-Cheat?," faced limitations, labeling partitions as suspicious due to safety filters against generating copyrighted content.

Implications and Future Directions

The presented method offers a way to ensure the integrity of LLM evaluations by detecting and addressing data contamination without direct access to pre-training datasets. The authors advocate for improved transparency in LLM training datasets and highlight the importance of unbiased evaluations in advancing NLP model development.

While the method adeptly identifies contaminated partitions, future studies could focus on refining the detection of contamination sources and addressing its varying manifestations. This paper's findings serve as a cornerstone for developing more reliable contamination detection techniques, shaping the future of LLM evaluations.