Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Contamination Report from the 2024 CONDA Shared Task (2407.21530v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (111)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355, Marseille, France. European Language Resources Association.
  2. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim. Leibniz-Institut für Deutsche Sprache.
  3. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
  4. Palm 2 technical report. Preprint, arXiv:2305.10403.
  5. Llemma: An open language model for mathematics. Preprint, arXiv:2310.10631.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609.
  7. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93, St. Julian’s, Malta. Association for Computational Linguistics.
  8. Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
  9. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  10. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
  11. On the opportunities and risks of foundation models. Preprint, arXiv:2108.07258.
  12. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  13. Language models are few-shot learners. Preprint, arXiv:2005.14165.
  14. Internlm2 technical report. Preprint, arXiv:2403.17297.
  15. e-snli: Natural language inference with natural language explanations. Preprint, arXiv:1812.01193.
  16. Evaluating large language models trained on code.
  17. Quac : Question answering in context. Preprint, arXiv:1808.07036.
  18. Palm: Scaling language modeling with pathways. Preprint, arXiv:2204.02311.
  19. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
  20. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  21. Structural scaffolds for citation intent classification in scientific publications. Preprint, arXiv:1904.01608.
  22. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  23. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  24. Investigating data contamination in modern benchmarks for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8698–8711, Mexico City, Mexico. Association for Computational Linguistics.
  25. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. Preprint, arXiv:2402.15938.
  27. Glam: Efficient scaling of language models with mixture-of-experts. Preprint, arXiv:2112.06905.
  28. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  29. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL.
  30. What’s in my big data? In The Twelfth International Conference on Learning Representations.
  31. Maxim Enis and Mark Hopkins. 2024. From llm to nmt: Advancing low-resource machine translation with claude. Preprint, arXiv:2404.13813.
  32. The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027.
  33. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237.
  34. Shahriar Golchin and Mihai Surdeanu. 2024a. Data contamination quiz: A tool to detect and estimate contamination in large language models. Preprint, arXiv:2311.06233.
  35. Shahriar Golchin and Mihai Surdeanu. 2024b. Time travel in LLMs: Tracing data contamination in large language models. In The Twelfth International Conference on Learning Representations.
  36. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
  37. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  38. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  39. Measuring mathematical problem solving with the math dataset. NeurIPS.
  40. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  41. A taxonomy and review of generalization research in nlp. Nature Machine Intelligence, 5(10):1161–1174.
  42. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore. Association for Computational Linguistics.
  43. Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data. arXiv e-prints, arXiv:2212.10440.
  44. Mistral 7b. Preprint, arXiv:2310.06825.
  45. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  46. The multilingual amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  47. SciTail: A textual entailment dataset from science question answering. In AAAI.
  48. Abstractive summarization of reddit posts with multi-level memory networks. Preprint, arXiv:1811.00783.
  49. The stack: 3 tb of permissively licensed source code. Preprint.
  50. Anastassia Kornilova and Vlad Eidelman. 2019. Billsum: A corpus for automatic summarization of us legislation. arXiv preprint arXiv:1910.00523.
  51. Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754, Online. Association for Computational Linguistics.
  52. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  53. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  54. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
  55. Generating text from structured data with application to the biography domain. CoRR, abs/1603.07771.
  56. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer.
  57. An open source data contamination report for large language models. Preprint, arXiv:2310.17589.
  58. Truthfulqa: Measuring how models mimic human falsehoods. Preprint, arXiv:2109.07958.
  59. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  60. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  61. Paloma: A benchmark for evaluating language model fit. arXiv preprint arXiv:2312.10523.
  62. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment.
  63. Philip May. 2021. Machine translated multilingual sts benchmark dataset.
  64. Evaluating n𝑛nitalic_n-gram novelty of language models using rusty-dawg. arXiv preprint arXiv:2406.13069.
  65. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
  66. The effect of natural distribution shift on question answering models. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6905–6916. PMLR.
  67. Measuring data. Preprint, arXiv:2212.05129.
  68. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  69. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  70. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.
  71. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
  72. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  73. No language left behind: Scaling human-centered machine translation.
  74. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  75. Proving test set contamination in black-box language models. In The Twelfth International Conference on Learning Representations.
  76. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online. Association for Computational Linguistics.
  77. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut f"ur Deutsche Sprache.
  78. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  79. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  80. How context affects language models’ factual predictions. In Automated Knowledge Base Construction.
  81. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint, arXiv:1910.10683.
  82. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  83. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  84. Ahad Rana. 2010. Common crawl – building an open web-scale crawl using hadoop.
  85. Investigating the impact of data contamination of large language models in text-to-sql translation. Preprint, arXiv:2402.08100.
  86. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  87. Quantifying contamination in evaluating code generation capabilities of language models. Preprint, arXiv:2403.04811.
  88. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Meeting of the Association for Computational Linguistics (ACL).
  89. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  90. Did chatgpt cheat on your test?
  91. Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 752–757, Melbourne, Australia. Association for Computational Linguistics.
  92. Noise-robust de-duplication at scale. In The Eleventh International Conference on Learning Representations.
  93. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
  94. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  95. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  96. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
  97. David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy. Association for Computational Linguistics.
  98. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537.
  99. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.
  100. William Yang Wang. 2017. " liar, liar pants on fire": A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648.
  101. Finetuned language models are zero-shot learners. Preprint, arXiv:2109.01652.
  102. Benchmarking benchmark leakage in large language models. Preprint, arXiv:2404.18824.
  103. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal. Association for Computational Linguistics.
  104. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification. In Proc. of EMNLP.
  105. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  106. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  107. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  108. Rui Zhang and Joel Tetreault. 2019. This email could save your life: Introducing the task of email subject line generation. Preprint, arXiv:1906.03497.
  109. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  110. Character-level convolutional networks for text classification. In NIPS.
  111. PAWS: Paraphrase Adversaries from Word Scrambling. In Proc. of NAACL.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (28)
  1. Oscar Sainz (14 papers)
  2. Iker García-Ferrero (14 papers)
  3. Alon Jacovi (26 papers)
  4. Jon Ander Campos (20 papers)
  5. Yanai Elazar (44 papers)
  6. Eneko Agirre (53 papers)
  7. Yoav Goldberg (142 papers)
  8. Wei-Lin Chen (12 papers)
  9. Jenny Chim (12 papers)
  10. Leshem Choshen (78 papers)
  11. Luca D'Amico-Wong (7 papers)
  12. Melissa Dell (17 papers)
  13. Run-Ze Fan (9 papers)
  14. Shahriar Golchin (9 papers)
  15. Yucheng Li (31 papers)
  16. Pengfei Liu (191 papers)
  17. Bhavish Pahwa (3 papers)
  18. Ameya Prabhu (37 papers)
  19. Suryansh Sharma (3 papers)
  20. Emily Silcock (7 papers)
Citations (4)

Summary

Data Contamination Report from the 2024 CONDA Shared Task

The paper "Data Contamination Report from the 2024 CONDA Shared Task" presents an extensive analysis on the issue of data contamination within the NLP ecosystem. Data contamination is defined as the inadvertent inclusion of evaluation data within pre-training corpora used for training large-scale models. This paper sheds light on the systemic presence of data contamination, which can compromise the validity of model evaluation results.

Significance and Methodology

Data contamination can introduce biases and artificially inflate model performance on specific tasks, thus misleading evaluations of model generalization capabilities. The 2024 CONDA Shared Task was designed to address this problem by fostering a collaborative effort to document instances of data contamination across existing datasets and models.

A structured, centralized public database was established to collect evidence of contamination, which is accessible for community contributions via GitHub. This database now contains 566 contamination reports from 91 sources, contributed by 23 researchers. Data-based and model-based approaches were employed to identify contamination events:

  • Data-based approaches: These involve analyzing pre-training corpora using techniques like n-gram or full-string overlap to identify contaminations.
  • Model-based approaches: These inspect the output of models through methods such as Membership Inference Attacks (MIA), and typically involve analyzing output probabilities or direct model prompting.

Compilation of Evidence

The paper systematically categorized 42 contaminated sources, 91 datasets, and 566 contamination entries:

  • Contaminated Corpora: Reports were accumulated for corpora largely based on CommonCrawl snapshots or compiled from multiple sources. Among commonly used corpora, C4, RedPajama v2, OSCAR, and the Pile showed significant contamination instances.
  • Contaminated Models: Models like GPT-3, GPT-4, and FLAN were frequently reported as contaminated. Contamination instances were also documented for open models like Mistral and Llama 2.

High-profile datasets such as GLUE, AI2 ARC, MMLU, and GSM8K emerged as frequently contaminated evaluation benchmarks. Contamination events were identified across various NLP tasks including text-scoring and multiple-choice question answering.

Trends and Statistics

Analyzing the dataset publication years, the majority of contamination reports pertained to datasets published between 2018 and 2021. The data reveals that newer models tend to be contaminated with more recent datasets. For instance, GPT-4 (released in 2023) often contained contamination from datasets published between 2018 and 2022, whereas GPT-3 (launched in 2020) predominantly showed contamination from datasets around 2016.

From the perspective of task contamination, text-scoring, QA, and multiple-choice QA tasks were among the most affected. Moreover, datasets with high download rates from platforms like Hugging Face are more likely to exhibit contamination due to their extensive usage in model training and evaluation.

Implications and Future Directions

The findings underscore the critical need for vigilant practices to prevent data contamination, especially as the scale of models and datasets continues to grow. The shared responsibility of identifying and mitigating data contamination lies with researchers, developers, and the broader NLP community. This report provides an essential resource and structured methodology for maintaining the integrity of model evaluations.

In future developments, it is imperative to further refine data-based and model-based detection techniques, especially in light of new and evolving datasets and models. Enhanced transparency and continued community contributions will be pivotal in sustaining robust and unbiased NLP research.

The data contamination database remains open for further submissions, ensuring that this crucial work continues to aid researchers in the timely identification and reporting of contamination instances. Such initiatives are vital for upholding the reliability and generalizability of NLP models.

Conclusion

This paper comprehensively documents instances and trends of data contamination in NLP, providing a valuable resource to the research community. By cataloging both contaminated and non-contaminated instances across a wide range of corpora and models, it offers crucial insights and methodologies to tackle data contamination challenges. This database serves as a cornerstone for the community’s ongoing efforts in addressing this pertinent issue.