Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 21 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks (2305.10160v2)

Published 17 May 2023 in cs.CL and cs.AI

Abstract: Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination. Strategies such as leaderboards with hidden answers, or using test data which is guaranteed to be unseen, are expensive and become fragile with time. Assuming that all relevant actors value clean test data and will cooperate to mitigate data contamination, what can be done? We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived data along with the data. These strategies are practical and can be effective in preventing data contamination.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. On the opportunities and risks of foundation models.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. C4-Search. 2023. C4 Search by Allen Institute for AI.
  5. Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188.
  6. Quantifying memorization across neural language models. In The International Conference on Learning Representations (ICLR).
  7. Extracting training data from large language models. In USENIX Security Symposium, volume 6.
  8. Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Fusing finetuned models for better pretraining.
  11. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  13. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  14. Compressing large-scale transformer-based models: A case study on BERT. Transactions of the Association for Computational Linguistics, 9:1061–1080.
  15. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  16. Abbas Ghaddar and Phillippe Langlais. 2016. Coreference in Wikipedia: Main concept resolution. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 229–238, Berlin, Germany. Association for Computational Linguistics.
  17. Google. 2023a. Palm 2 technical report.
  18. Google. 2023b. PaLM API & MakerSuite: an approachable way to start prototyping and building generative AI applications.
  19. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  20. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
  21. Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. CoRR, abs/1412.1058.
  22. GENIE: Toward reproducible and standardized human evaluation for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11444–11458, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. A watermark for large language models.
  24. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. Holistic evaluation of language models.
  26. Evaluating verifiability in generative search engines.
  27. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  28. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  29. Marc Marone and Benjamin Van Durme. 2023. Data portraits: Recording foundation model training data.
  30. Measuring data.
  31. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265.
  32. MosaicML. 2023. Mosaicml inference. https://www.mosaicml.com/inference.
  33. Gene Myers. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM (JACM), 46(3):395–415.
  34. OpenAI. 2023. Gpt-4 technical report.
  35. Thoughtsource: A central hub for large language model reasoning data.
  36. The roots search tool: Data transparency for llms.
  37. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  39. P Venkata Rajeev and V Smrithi Rekha. 2015. Recommending products to customers using opinion mining of online product reviews and features. In 2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015], pages 1–5. IEEE.
  40. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. Beyond accuracy: Behavioral testing of NLP models with checklist. CoRR, abs/2005.04118.
  42. Can ai-generated text be reliably detected?
  43. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  44. Contrastive distillation on intermediate representations for language model compression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 498–508, Online. Association for Computational Linguistics.
  45. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  47. Esko Ukkonen. 1985. Algorithms for approximate string matching. Information and control, 64(1-3):100–118.
  48. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
  49. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
  50. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning (ICML).
  51. Robust fine-tuning of zero-shot models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Citations (83)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com