Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models (2402.15938v3)

Published 24 Feb 2024 in cs.CL, cs.AI, cs.CR, cs.LG, and cs.SE

Abstract: Recent statements about the impressive capabilities of LLMs are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9\% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Language models are few-shot learners. CoRR, abs/2005.14165.
  2. Evaluating large language models trained on code. CoRR.
  3. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  4. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  5. Investigating data contamination in modern benchmarks for large language models. CoRR, abs/2311.09783.
  6. Ben Dickson. 2023. Why data contamination is a big issue for llms.
  7. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In EMNLP (1), pages 1286–1305. Association for Computational Linguistics.
  8. Codescore: Evaluating code generation by learning code execution. CoRR, abs/2301.09043.
  9. Self-collaboration code generation via chatgpt. CoRR, abs/2304.07590.
  10. CODEP: grammatical seq2seq model for general-purpose code generation. In ISSTA.
  11. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. CoRR, abs/2308.08493.
  12. Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net.
  13. Competition-level problems are effective LLM evaluators. CoRR, abs/2312.02143.
  14. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In EMNLP, pages 5075–5084. Association for Computational Linguistics.
  15. Self-planning code generation with large language model. CoRR, abs/2303.06689.
  16. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  17. Deveval: Evaluating code generation in practical software projects. CoRR, abs/2401.06401.
  18. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  19. Yucheng Li. 2023. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. CoRR, abs/2309.10677.
  20. Latesteval: Addressing data contamination in language model evaluation through dynamic and time-sensitive test construction. CoRR, abs/2312.12343.
  21. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In ACL (2), pages 157–165. Association for Computational Linguistics.
  22. Codegen: An open large language model for code with multi-turn program synthesis. In ICLR. OpenReview.net.
  23. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  24. Proving test set contamination in black box language models. CoRR, abs/2310.17623.
  25. Privacy risks of general-purpose language models. In SP, pages 1314–1331. IEEE.
  26. Data contamination through the lens of time. CoRR, abs/2310.10628.
  27. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  28. Incorporating domain knowledge through task augmentation for front-end javascript code generation. In ESEC/SIGSOFT FSE, pages 1533–1543. ACM.
  29. Detecting pretraining data from large language models. CoRR, abs/2310.16789.
  30. Rethinking benchmark and contamination for language models with rephrased samples. CoRR, abs/2311.04850.
  31. Don’t make your LLM an evaluation benchmark cheater. CoRR, abs/2311.01964.
  32. CLEAN-EVAL: clean evaluation on contaminated large language models. CoRR, abs/2311.09154.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yihong Dong (35 papers)
  2. Xue Jiang (82 papers)
  3. Huanyu Liu (15 papers)
  4. Zhi Jin (160 papers)
  5. Ge Li (213 papers)
  6. Bin Gu (86 papers)
  7. Mengfei Yang (6 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com