Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation (2406.14644v1)

Published 20 Jun 2024 in cs.CL

Abstract: Data contamination has garnered increased attention in the era of LLMs due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chunyuan Deng (9 papers)
  2. Yilun Zhao (59 papers)
  3. Yuzhao Heng (4 papers)
  4. Yitong Li (95 papers)
  5. Jiannan Cao (9 papers)
  6. Xiangru Tang (62 papers)
  7. Arman Cohan (121 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com