Detecting and preventing data contamination in LLM evaluation

Develop reliable, scalable techniques to detect and prevent data contamination—defined as any overlap between training data and test data that causes benchmark results to overestimate generalization performance—in the pretraining and evaluation pipelines of large language models.

Background

The paper defines data contamination (also called test-set contamination) as any overlap between a model’s training data and the benchmark test data that inflates reported generalization. The authors note that the scale and limited curation of modern pretraining corpora amplify these concerns for LLM evaluations.

While the paper’s main focus is on a distinct issue—training on the test task—it acknowledges that robustly detecting and preventing contamination remains unresolved despite multiple recent efforts. The authors cite contemporary work highlighting the difficulty of tracing and mitigating contamination and emphasize the need for principled solutions that can function at the scale of current LLM pipelines.

References

However, detecting and preventing data contamination is currently an open problem~\citep{gunasekar2023textbooks,yang2023rethinking, golchin2023time}.

— Training on the Test Task Confounds Evaluation and Emergence (2407.07890 - Dominguez-Olmedo et al., 10 Jul 2024) in Section 7 (Related work), Data contamination paragraph

Detecting and preventing data contamination in LLM evaluation

Sponsor

Background

References

Related Problems