Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation (2309.10677v2)

Published 19 Sep 2023 in cs.CL and cs.AI

Abstract: Data contamination in model evaluation is getting increasingly prevalent as the massive training corpora of LLMs often unintentionally include benchmark samples. Therefore, contamination analysis has became an inevitable part of reliable model evaluation. However, existing method of contamination analysis requires the access of the entire training data which is often confidential for recent models. This prevent the community to rigorously audit these models and conduct accurate assessment of their capability. In this paper, we propose a novel method to quantify contamination without the access of the full training set, that measure the extent of contamination with perplexity. Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
  3. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  4. Quac: Question answering in context. arXiv preprint arXiv:1808.07036.
  5. Palm: Scaling language modeling with pathways.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  7. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
  8. Measuring massive multitask language understanding.
  9. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks.
  10. Benjamin Marie. 2023. The decontaminated evaluation of gpt-4. Accessed: 2023-07-28.
  11. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
  12. OpenAI. 2023. Gpt-4 technical report.
  13. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
  14. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  16. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yucheng Li (31 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com