Data Contamination in LLMs: Detection and Implications
The paper, "Time Travel in LLMs: Tracing Data Contamination in LLMs," introduces a method for identifying data contamination in LLMs such as GPT-4 and GPT-3.5. Data contamination refers to instances when test data from downstream tasks unintentionally ends up in the training data of LLMs, potentially skewing their effectiveness and performance evaluation. This paper proposes a robust, cost-effective method to detect such contamination, emphasizing the need for accurate evaluation techniques free from inflated benchmarks due to contaminated datasets.
Approach and Methodology
The authors employ a strategy focusing on detecting contamination at the instance level, which is later generalized to partition-level contaminations. The approach comprises a sequence of steps:
- Guided Instruction: The method starts with a guided instruction that prompts the LLM using metadata, such as the dataset name, partition type, an initial segment of a reference instance, and its label if available. This prompt guides the LLM to reproduce the subsequent section of the reference text, assuming contamination in the training data.
- General Instruction: To establish a baseline for contamination, a general instruction—lacking dataset-specific metadata—is used. This allows a comparison between outputs from both instructions to highlight the influence of guided cues on completion accuracy.
- Assessment and Evaluation: The paper introduces two evaluation algorithms. The first assesses statistical significance in overlap scores using BLEURT and ROUGE-L between outputs from guided and general instructions. The second leverages GPT-4 in few-shot in-context learning prompts to detect exact or near-exact matches with reference instances based on human evaluations.
Detection of Partition-level Contamination
The proposed method extrapolates partition-level contamination from detected instance-level signals, employing criteria such as significant statistical differences in overlap scores and the detection of exact or near-exact matches. The robustness of this approach is demonstrated through experiments involving datasets across classification, summarization, and NLI tasks.
Through controlled contamination experiments with GPT-3.5, the authors validate their method's efficacy, highlighting that LLM-generated exact matches of dataset instances strongly indicate contamination.
Experimental Findings
The paper conducted evaluations using seven datasets on their training and test/validation splits with LLM snapshots from GPT-3.5 and GPT-4. The results show that:
- The guided instruction paired with few-shot in-context learning using GPT-4 outperformed other methods, achieving high accuracy rates (100% for GPT-4, 92.86% for GPT-3.5) compared to human evaluation.
- Existing LLMs such as GPT-4 demonstrated evident contamination with datasets like AG News and WNLI, emphasizing concerns over LLM evaluation benchmarks.
- The comparison method, "ChatGPT-Cheat?," faced limitations, labeling partitions as suspicious due to safety filters against generating copyrighted content.
Implications and Future Directions
The presented method offers a way to ensure the integrity of LLM evaluations by detecting and addressing data contamination without direct access to pre-training datasets. The authors advocate for improved transparency in LLM training datasets and highlight the importance of unbiased evaluations in advancing NLP model development.
While the method adeptly identifies contaminated partitions, future studies could focus on refining the detection of contamination sources and addressing its varying manifestations. This paper's findings serve as a cornerstone for developing more reliable contamination detection techniques, shaping the future of LLM evaluations.