Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Open Source Data Contamination Report for Large Language Models (2310.17589v3)

Published 26 Oct 2023 in cs.CL and cs.AI

Abstract: Data contamination in model evaluation has become increasingly prevalent with the growing popularity of LLMs. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has become an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by LLM developers and often lacks transparency and completeness. This paper presents an extensive data contamination report for over 15 popular LLMs across six popular multiple-choice QA benchmarks. We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models. Our experiments reveal varying contamination levels ranging from 1\% to 45\% across benchmarks, with the contamination degree increasing rapidly over time. Performance analysis of LLMs indicates that data contamination does not necessarily lead to increased model metrics: while significant accuracy boosts of up to 14\% and 7\% are observed on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is noted on contaminated MMLU. We also find larger models seem able to gain more advantages than smaller models on contaminated test sets.

An Analysis of Data Contamination in LLMs

The paper "An Open-Source Data Contamination Report for LLMs" offers a comprehensive examination of data contamination in the context of LLMs. The paper acknowledges the increasing prevalence of data contamination, where test examples inadvertently appear in training datasets, potentially compromising the validity of model evaluations by allowing models to memorize rather than generalize.

Methodology and Contributions

The authors present an open-source pipeline to address the transparency deficits in existing data contamination studies, which are typically conducted internally by LLM developers. This pipeline aims to enable the community to systematically analyze contamination across custom datasets and models. The paper focuses on 15 popular LLMs and six multiple-choice question-answering (MCQA) benchmarks: Winogrande, AI2 ARC, CommonsenseQA, HellaSwag, MMLU, and C-Eval. Key findings from these evaluations reveal the contamination levels range from 1% to 45.8% across different benchmarks.

Key Findings and Evaluations

A notable insight from the research is the rapid increase in contamination levels over time, as observed by comparing Common Crawl archives from December 2020 to October 2023. Additionally, the paper highlights that larger models tend to benefit more from contaminated datasets compared to smaller models, due to their enhanced memorization capabilities. However, this advantage does not uniformly translate to better performance across all benchmarks; while some datasets exhibit accuracy improvements, others show minimal changes or even reductions in performance.

The results also underscore that contamination is not homogeneously distributed across internet domains, suggesting that strategic domain filtering could mitigate contamination risks. The research provides detailed performance assessments on clean versus contaminated subsets, showing significant accuracy differences particularly for larger models on contaminated sets.

Implications and Future Directions

The implications of this work are manifold. Practically, the methodologies proposed provide researchers and practitioners with tools to better audit and understand the contamination impact on benchmark results. Theoretically, the findings provoke a deeper examination into the reliance on potentially contaminated datasets for training and evaluating LLMs.

Future work could expand on this research by employing less restrictive contamination detection methodologies to capture a broader range of contamination scenarios. Additionally, exploring methods to reduce the reliance on web-sourced data, or enhance the robustness of models against memorization, could yield significant benefits.

Conclusion

This paper significantly contributes to the discourse on AI model evaluation integrity by shedding light on the implications of data contamination. It calls for more transparent and community-driven approaches to contamination analysis to ensure the reliable assessment of LLM capabilities and to guide the ongoing development of robust, generalizable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Qwen technical report.
  3. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  4. Piqa: Reasoning about physical commonsense in natural language.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
  7. Evaluating large language models trained on code.
  8. Quac: Question answering in context. arXiv preprint arXiv:1808.07036.
  9. Palm: Scaling language modeling with pathways.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
  12. Training verifiers to solve math word problems.
  13. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
  14. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  15. Measuring massive multitask language understanding.
  16. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  17. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks.
  18. Mistral 7b.
  19. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.
  20. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  21. Benjamin Marie. 2023. The decontaminated evaluation of gpt-4. Accessed: 2023-07-28.
  22. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. OpenCompass. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  25. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
  26. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  27. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  28. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
  29. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  30. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  34. Baichuan 2: Open large-scale language models.
  35. Yi. 2023. A series of large language models trained from scratch by developers at 01-ai. https://github.com/01-ai/Yi.
  36. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  37. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yucheng Li (31 papers)
  2. Frank Guerin (30 papers)
  3. Chenghua Lin (127 papers)
Citations (9)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com