Proving Test Set Contamination in Black Box Language Models (2310.17623v2)

Published 26 Oct 2023 in cs.CL and cs.LG

Abstract: LLMs are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in LLMs without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for LLMs to memorize example order means that a contaminated LLM will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible LLMs for test set contamination and find little evidence for pervasive contamination.

PDF HTML Abstract

A Formal Analysis of Test Set Contamination in Black Box LLMs

The paper "Proving Test Set Contamination in Black Box LLMs" addresses a fundamental issue related to the trustworthiness of performance metrics of LLMs. Specifically, it confronts the challenge of verifying whether a model's pre-training data has inadvertently included evaluation benchmarks, thereby skewing performance metrics through memorization rather than genuine generalization. This issue cannot be easily dismissed or identified due to the proprietary nature of models and the opacity of their training data.

Methodological Contributions

The authors introduce a methodologically rigorous approach to resolve the problem of test set contamination. Their proposed method does not require access to the model's pre-training data or weight parameters. Instead, the strategy leverages the statistical property of exchangeability. Exchangeability implies that any permutation of a dataset's ordering should be equally likely. Thus, a LLM trained on a dataset should exhibit no preference for any specific ordering. The crux of their method involves testing the likelihood of a model's canonical ordering of a dataset against random permutations; notable discrepancies would suggest memorization.

A key innovation is the sharded likelihood comparison test. By partitioning the dataset into multiple shards and comparing the log probability likelihood of these shards with permuted sequences, the authors enhance the statistical power and computational efficiency of their method. This sharding technique addresses statistical and computational limits present in conventional permutation tests. With rigorous statistical grounding, they provide asymptotic false positive guarantees that affirm the validity of identified test set contamination.

Empirical Findings

The authors present convincing empirical evaluations that validate their approach's effectiveness across various experimental settings. Specifically, their paper involves injecting known test sets into pre-training corpora of 1.4 billion parameter models. The experiments reveal that the proposed method can reliably detect even low rates of contamination, with strong statistical significance particularly when datasets are duplicated ten times or more within the training corpus. Furthermore, the sharded likelihood comparison test is shown to outperform traditional permutation methods in detecting contamination in computationally demanding settings.

Testing on publicly accessible models like LLaMA2, Mistral-7B, and GPT-2 demonstrates limited evidence of extensive contamination, consistent with previous findings by model developers. This empirical application underlines the method's potential as a tool for independent audits of benchmarking integrity in LLMs.

Implications and Future Directions

This work has both practical and theoretical implications. Practically, it provides a robust tool for the research community to independently verify model training integrity and its relation to reported benchmarks. The authors release their models and datasets as benchmarks, encouraging further developments in this vital area. Theoretically, the method refines our understanding of information leakage in gigantic pre-training datasets and sets new directions for privacy-preserving machine learning research.

Future research avenues could explore improving the power of these methods to detect single-instance contamination, aligning theoretical developments with practical applications in model auditing. Moreover, expanding the framework to handle non-exchangeable datasets, or those partially represented in training data without direct duplication, would enhance its applicability. Given the growing complexity and capability of LLMs, ensuring the veracity of their performance remains an essential task for advancing reliable AI deployment.

Conclusion

The paper presents a rigorous framework for evaluating test set contamination in black box LLMs, enriching the current methodologies available to the field. The statistical tests developed are both computationally efficient and powerful, providing meaningful insights into recognizing hidden dataset contamination. This work underscores the necessity for transparency in LLM training, advocating for consistent external auditing to uphold the integrity of AI research benchmaking.