Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Data Contamination Report from the 2024 CONDA Shared Task (2407.21530v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.

PDF HTML Abstract

Data Contamination Report from the 2024 CONDA Shared Task

The paper "Data Contamination Report from the 2024 CONDA Shared Task" presents an extensive analysis on the issue of data contamination within the NLP ecosystem. Data contamination is defined as the inadvertent inclusion of evaluation data within pre-training corpora used for training large-scale models. This paper sheds light on the systemic presence of data contamination, which can compromise the validity of model evaluation results.

Significance and Methodology

Data contamination can introduce biases and artificially inflate model performance on specific tasks, thus misleading evaluations of model generalization capabilities. The 2024 CONDA Shared Task was designed to address this problem by fostering a collaborative effort to document instances of data contamination across existing datasets and models.

A structured, centralized public database was established to collect evidence of contamination, which is accessible for community contributions via GitHub. This database now contains 566 contamination reports from 91 sources, contributed by 23 researchers. Data-based and model-based approaches were employed to identify contamination events:

Data-based approaches: These involve analyzing pre-training corpora using techniques like n-gram or full-string overlap to identify contaminations.
Model-based approaches: These inspect the output of models through methods such as Membership Inference Attacks (MIA), and typically involve analyzing output probabilities or direct model prompting.

Compilation of Evidence

The paper systematically categorized 42 contaminated sources, 91 datasets, and 566 contamination entries:

Contaminated Corpora: Reports were accumulated for corpora largely based on CommonCrawl snapshots or compiled from multiple sources. Among commonly used corpora, C4, RedPajama v2, OSCAR, and the Pile showed significant contamination instances.
Contaminated Models: Models like GPT-3, GPT-4, and FLAN were frequently reported as contaminated. Contamination instances were also documented for open models like Mistral and Llama 2.

High-profile datasets such as GLUE, AI2 ARC, MMLU, and GSM8K emerged as frequently contaminated evaluation benchmarks. Contamination events were identified across various NLP tasks including text-scoring and multiple-choice question answering.

Trends and Statistics

Analyzing the dataset publication years, the majority of contamination reports pertained to datasets published between 2018 and 2021. The data reveals that newer models tend to be contaminated with more recent datasets. For instance, GPT-4 (released in 2023) often contained contamination from datasets published between 2018 and 2022, whereas GPT-3 (launched in 2020) predominantly showed contamination from datasets around 2016.

From the perspective of task contamination, text-scoring, QA, and multiple-choice QA tasks were among the most affected. Moreover, datasets with high download rates from platforms like Hugging Face are more likely to exhibit contamination due to their extensive usage in model training and evaluation.

Implications and Future Directions

The findings underscore the critical need for vigilant practices to prevent data contamination, especially as the scale of models and datasets continues to grow. The shared responsibility of identifying and mitigating data contamination lies with researchers, developers, and the broader NLP community. This report provides an essential resource and structured methodology for maintaining the integrity of model evaluations.

In future developments, it is imperative to further refine data-based and model-based detection techniques, especially in light of new and evolving datasets and models. Enhanced transparency and continued community contributions will be pivotal in sustaining robust and unbiased NLP research.

The data contamination database remains open for further submissions, ensuring that this crucial work continues to aid researchers in the timely identification and reporting of contamination instances. Such initiatives are vital for upholding the reliability and generalizability of NLP models.

Conclusion

This paper comprehensively documents instances and trends of data contamination in NLP, providing a valuable resource to the research community. By cataloging both contaminated and non-contaminated instances across a wide range of corpora and models, it offers crucial insights and methodologies to tackle data contamination challenges. This database serves as a cornerstone for the community’s ongoing efforts in addressing this pertinent issue.

PDF Markdown Bookmark Chat (Pro)

References (111)

Authors (28)

Oscar Sainz (14 papers)
Iker García-Ferrero (14 papers)
Alon Jacovi (26 papers)
Jon Ander Campos (20 papers)
Yanai Elazar (44 papers)
Eneko Agirre (53 papers)
Yoav Goldberg (142 papers)
Wei-Lin Chen (12 papers)
Jenny Chim (12 papers)
Leshem Choshen (78 papers)
Luca D'Amico-Wong (7 papers)
Melissa Dell (17 papers)
Run-Ze Fan (9 papers)
Shahriar Golchin (9 papers)
Yucheng Li (31 papers)
Pengfei Liu (191 papers)
Bhavish Pahwa (3 papers)
Ameya Prabhu (37 papers)
Suryansh Sharma (3 papers)
Emily Silcock (7 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/yanaiela/status/1820889230882828791

https://twitter.com/osainz59/status/1819882330292736260

https://twitter.com/TheTuringPost/status/1820826235179765888

https://twitter.com/gm8xx8/status/1819185791098097714

https://twitter.com/GptMaestro/status/1819810052687200648