Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models (2311.06233v6)

Published 10 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We propose the Data Contamination Quiz (DCQ), a simple and effective approach to detect data contamination in LLMs and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition (e.g., GSM8k test set) are created. These changes only include word-level perturbations. The generated perturbations, along with the original dataset instance, form the options in the DCQ, with an extra option accommodating the possibility of selecting none of the provided options. Given that the only distinguishing signal among the options is the exact wording with respect to the original dataset instance, an LLM, when tasked with identifying the original dataset instance, gravitates towards selecting the original one if it has been exposed to it in its pre-training phase -- a trait intrinsic to LLMs. While accounting for positional biases in LLMs, the quiz performance reveals the contamination level for the model being examined with the dataset partition to which the quiz pertains. Applied to various datasets with GPT-4 and GPT-3.5, our findings -- while fully lacking access to pre-training data and model parameters -- suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods and proficiently bypasses more safety filters, especially those set to avoid generating copyrighted contents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  3. The fifth pascal recognizing textual entailment challenge. TAC, 7:8, 2009.
  4. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=TatRHT_1cK.
  7. Evaluating large language models trained on code. 2021.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37 – 46, 1960.
  11. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pp.  177–190. Springer, 2005.
  12. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  2368–2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
  13. Galton, F. Finger prints. Number 57490-57492. Macmillan and Company, 1892.
  14. The third PASCAL recognizing textual entailment challenge. In Sekine, S., Inui, K., Dagan, I., Dolan, B., Giampiccolo, D., and Magnini, B. (eds.), Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp.  1–9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401.
  15. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
  16. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
  17. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, pp.  785–794, 2006.
  18. Meetingbank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, 2023. Association for Computational Linguistics.
  19. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  20. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  21. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  22. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
  23. OpenAI. Gpt-4 technical report, 2023.
  24. Proving test set contamination in black box language models. ArXiv, abs/2310.17623, 2023. URL https://api.semanticscholar.org/CorpusID:264490730.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  27. Data contamination through the lens of time. ArXiv, abs/2310.10628, 2023. URL https://api.semanticscholar.org/CorpusID:264172693.
  28. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018, 2023a.
  29. Did chatgpt cheat on your test? https://hitz-zentroa.github.io/lm-contamination/blog/, 2023b. Accessed: 2023-11-01.
  30. Overview of autextification at iberlef 2023: Detection and attribution of machine-generated text in multiple domains. In Procesamiento del Lenguaje Natural, Jaén, Spain, September 2023.
  31. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  32. Detecting pretraining data from large language models, 2023.
  33. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  36. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  37. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019.
  38. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  39. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  40. Character-level convolutional networks for text classification. In NIPS, 2015.
  41. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023.
  42. Don’t make your llm an evaluation benchmark cheater. ArXiv, abs/2311.01964, 2023. doi: 10.48550/arXiv.2311.01964. URL https://api.semanticscholar.org/CorpusID:265019021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shahriar Golchin (9 papers)
  2. Mihai Surdeanu (53 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com