Papers
Topics
Authors
Recent
Search
2000 character limit reached

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Published 24 Oct 2024 in cs.CL | (2410.18966v3)

Abstract: LLMs have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. Multiple approaches have been developed to identify data contamination. These approaches rely on specific assumptions that may not hold universally across different settings. To bridge this gap, we systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated. We identify and analyze eight categories of assumptions and test three of them as case studies. Our case studies focus on detecting direct, instance-level data contamination, which is also referred to as Membership Inference Attacks (MIA). Our analysis reveals that MIA approaches based on these three assumptions can have similar performance to random guessing, on datasets used in LLM pretraining, suggesting that current LLMs might learn data distributions rather than memorizing individual instances. Meanwhile, MIA can easily fail when there are data distribution shifts between the seen and unseen instances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Can we trust the evaluation on ChatGPT? In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 47–54, Toronto, Canada. Association for Computational Linguistics.
  3. Anthropic. 2024a. Claude 3 haiku: our fastest model yet. Accessed on Oct 6, 2024.
  4. Anthropic. 2024b. Introducing claude 3.5 sonnet. Accessed on Oct 6, 2024.
  5. Anthropic. 2024c. Introducing the next generation of claude. Accessed on Oct 6, 2024.
  6. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93.
  7. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, pages 2397–2430.
  8. Scaling laws for data poisoning in llms. Preprint, arXiv:2408.02946.
  9. Concerned with data contamination? assessing countermeasures in code language model. arXiv preprint arXiv:2403.16898.
  10. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations.
  11. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA. USENIX Association.
  12. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  13. Speak, memory: An archaeology of books known to chatgpt/gpt-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327.
  14. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
  15. Evading data contamination detection for language models is (too) easy.
  16. Investigating data contamination in modern benchmarks for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8698–8711.
  17. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305.
  18. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938.
  19. Do membership inference attacks work on large language models? In Conference on Language Modeling (COLM).
  20. DE-COP: Detecting copyrighted content in language models training data. In Forty-first International Conference on Machine Learning.
  21. Memorization vs. generalization: Quantifying data leakage in nlp performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335.
  22. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  23. Shahriar Golchin and Mihai Surdeanu. 2023a. Data contamination quiz: A tool to detect and estimate contamination in large language models. CoRR, abs/2311.06233.
  24. Shahriar Golchin and Mihai Surdeanu. 2023b. Data contamination quiz: A tool to detect and estimate contamination in large language models. CoRR, abs/2311.06233.
  25. The curious case of neural text degeneration. In International Conference on Learning Representations.
  26. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37.
  27. Shotaro Ishihara. 2023. Training data extraction from pre-trained language models: A survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 260–275.
  28. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore. Association for Computational Linguistics.
  29. Membership inference attack susceptibility of clinical language models. arXiv preprint arXiv:2104.08305.
  30. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  31. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
  32. Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403–7412, Singapore. Association for Computational Linguistics.
  33. Platypus: Quick, cheap, and powerful refinement of llms. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  34. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445.
  35. Changmao Li and Jeffrey Flanigan. 2024. Task contamination: Language models may not be few-shot anymore. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18471–18480.
  36. Yucheng Li. 2023a. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. arXiv preprint arXiv:2309.10677.
  37. Yucheng Li. 2023b. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589.
  38. Evaluating Chinese large language models on discipline knowledge acquisition via memorization and robustness assessment. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 1–12, Bangkok, Thailand. Association for Computational Linguistics.
  39. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165.
  40. Tofu: A task of fictitious unlearning for llms.
  41. Llm dataset inference: Did you train on my dataset? The 1st Workshop on Data Contamination (CONDA).
  42. Marc Marone and Benjamin Van Durme. 2023. Data portraits: Recording foundation model training data. In Advances in Neural Information Processing Systems, volume 36, pages 15121–15135. Curran Associates, Inc.
  43. Membership inference attacks against language models via neighbourhood comparison. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343.
  44. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670.
  45. Confounders in instance variation for the analysis of data contamination. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 13–21, Bangkok, Thailand. Association for Computational Linguistics.
  46. Meta AI. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-04-18.
  47. Quantifying privacy risks of masked language models using membership inference attacks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8332–8347, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning, pages 24950–24962. PMLR.
  49. OpenAI. Models - openai api. Accessed on Oct 6, 2024.
  50. OpenAI. 2024a. Gpt-4o mini: advancing cost-efficient intelligence. Accessed on Oct 6, 2024.
  51. OpenAI. 2024b. Hello gpt-4o. Accessed on Oct 6, 2024.
  52. Proving test set contamination in black-box language models. In The Twelfth International Conference on Learning Representations.
  53. A taxonomy for data contamination in large language models. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 22–40, Bangkok, Thailand. Association for Computational Linguistics.
  54. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331.
  55. Sundar Pichai and Demis Hassabis. 2024. Our next-generation model: Gemini 1.5. Accessed on Oct 6, 2024.
  56. The roots search tool: Data transparency for llms. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 304–314.
  57. Investigating the impact of data contamination of large language models in text-to-sql translation. arXiv preprint arXiv:2402.08100.
  58. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854.
  59. Quantifying contamination in evaluating code generation capabilities of language models. arXiv preprint arXiv:2403.04811.
  60. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787.
  61. Did chatgpt cheat on your test? Accessed: 2024-09-09.
  62. Rethinking llm memorization through the lens of adversarial compression. The 1st Workshop on Data Contamination (CONDA).
  63. Detecting pretraining data from large language models. In NeurIPS 2023 Workshop on Regulatable ML.
  64. Congzheng Song and Vitaly Shmatikov. 2019. Auditing data provenance in text-generation models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 196–206, New York, NY, USA. Association for Computing Machinery.
  65. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  66. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
  67. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. arXiv preprint arXiv:2306.07899.
  68. Proving membership in LLM pretraining data via data watermarks. In Findings of the Association for Computational Linguistics ACL 2024, pages 13306–13320, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  69. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244.
  70. Rethinking benchmark and contamination for language models with rephrased samples. Preprint, arXiv:2311.04850.
  71. To err is human, how about medical large language models? comparing pre-trained language models for medical assessment errors and reliability. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16211–16223, Torino, Italia. ELRA and ICCL.
  72. Analyzing information leakage of updates to natural language models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pages 363–375.
  73. Counterfactual memorization in neural language models. Advances in Neural Information Processing Systems, 36:39321–39362.
  74. A survey of large language models. arXiv preprint arXiv:2303.18223.
  75. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  76. CLEAN–EVAL: Clean evaluation on contaminated large language models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 835–847, Mexico City, Mexico. Association for Computational Linguistics.
  77. Fool your (vision and) language model with embarrassingly simple permutations. arXiv preprint arXiv:2310.01651.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.