An Open Source Data Contamination Report for Large Language Models (2310.17589v3)

Published 26 Oct 2023 in cs.CL and cs.AI

Abstract: Data contamination in model evaluation has become increasingly prevalent with the growing popularity of LLMs. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has become an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by LLM developers and often lacks transparency and completeness. This paper presents an extensive data contamination report for over 15 popular LLMs across six popular multiple-choice QA benchmarks. We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models. Our experiments reveal varying contamination levels ranging from 1\% to 45\% across benchmarks, with the contamination degree increasing rapidly over time. Performance analysis of LLMs indicates that data contamination does not necessarily lead to increased model metrics: while significant accuracy boosts of up to 14\% and 7\% are observed on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is noted on contaminated MMLU. We also find larger models seem able to gain more advantages than smaller models on contaminated test sets.

PDF HTML Abstract

An Analysis of Data Contamination in LLMs

The paper "An Open-Source Data Contamination Report for LLMs" offers a comprehensive examination of data contamination in the context of LLMs. The paper acknowledges the increasing prevalence of data contamination, where test examples inadvertently appear in training datasets, potentially compromising the validity of model evaluations by allowing models to memorize rather than generalize.

Methodology and Contributions

The authors present an open-source pipeline to address the transparency deficits in existing data contamination studies, which are typically conducted internally by LLM developers. This pipeline aims to enable the community to systematically analyze contamination across custom datasets and models. The paper focuses on 15 popular LLMs and six multiple-choice question-answering (MCQA) benchmarks: Winogrande, AI2 ARC, CommonsenseQA, HellaSwag, MMLU, and C-Eval. Key findings from these evaluations reveal the contamination levels range from 1% to 45.8% across different benchmarks.

Key Findings and Evaluations

A notable insight from the research is the rapid increase in contamination levels over time, as observed by comparing Common Crawl archives from December 2020 to October 2023. Additionally, the paper highlights that larger models tend to benefit more from contaminated datasets compared to smaller models, due to their enhanced memorization capabilities. However, this advantage does not uniformly translate to better performance across all benchmarks; while some datasets exhibit accuracy improvements, others show minimal changes or even reductions in performance.

The results also underscore that contamination is not homogeneously distributed across internet domains, suggesting that strategic domain filtering could mitigate contamination risks. The research provides detailed performance assessments on clean versus contaminated subsets, showing significant accuracy differences particularly for larger models on contaminated sets.

Implications and Future Directions

The implications of this work are manifold. Practically, the methodologies proposed provide researchers and practitioners with tools to better audit and understand the contamination impact on benchmark results. Theoretically, the findings provoke a deeper examination into the reliance on potentially contaminated datasets for training and evaluating LLMs.

Future work could expand on this research by employing less restrictive contamination detection methodologies to capture a broader range of contamination scenarios. Additionally, exploring methods to reduce the reliance on web-sourced data, or enhance the robustness of models against memorization, could yield significant benefits.

Conclusion

This paper significantly contributes to the discourse on AI model evaluation integrity by shedding light on the implications of data contamination. It calls for more transparent and community-driven approaches to contamination analysis to ensure the reliable assessment of LLM capabilities and to guide the ongoing development of robust, generalizable AI systems.