ConStat: Performance-Based Contamination Detection in Large Language Models (2405.16281v1)
Abstract: Public benchmarks play an essential role in the evaluation of LLMs. However, data contamination can lead to inflated performance, rendering them unreliable for model comparison. It is therefore crucial to detect contamination and estimate its impact on measured performance. Unfortunately, existing detection methods can be easily evaded and fail to quantify contamination. To overcome these limitations, we propose a novel definition of contamination as artificially inflated and non-generalizing benchmark performance instead of the inclusion of benchmark samples in the training data. This perspective enables us to detect any model with inflated performance, i.e., performance that does not generalize to rephrased samples, synthetic samples from the same distribution, or different benchmarks for the same task. Based on this insight, we develop ConStat, a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models. We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios and find high levels of contamination in multiple popular models including Mistral, Llama, Yi, and the top-3 Open LLM Leaderboard models.
- Phi-3 technical report: A highly capable language model locally on your phone, 2024.
- AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- The falcon series of language models: Towards open frontier models. 2023.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2357–2367. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1245. URL https://doi.org/10.18653/v1/n19-1245.
- Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805.
- Qwen technical report. CoRR, abs/2309.16609, 2023. doi: 10.48550/ARXIV.2309.16609. URL https://doi.org/10.48550/arXiv.2309.16609.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Stable LM 2 1.6b technical report. CoRR, abs/2402.17834, 2024. doi: 10.48550/ARXIV.2402.17834. URL https://doi.org/10.48550/arXiv.2402.17834.
- Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. doi: https://doi.org/10.1111/j.2517-6161.1995.tb02031.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1995.tb02031.x.
- Language models are few-shot learners. In Proc. of NeurIPS, 2020.
- Internlm2 technical report. CoRR, abs/2403.17297, 2024. doi: 10.48550/ARXIV.2403.17297. URL https://doi.org/10.48550/arXiv.2403.17297.
- Extracting training data from large language models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, 2021.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018.
- Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021.
- Evading data contamination detection for language models is (too) easy. CoRR, abs/2402.02823, 2024. doi: 10.48550/ARXIV.2402.02823. URL https://doi.org/10.48550/arXiv.2402.02823.
- Investigating data contamination in modern benchmarks for large language models. CoRR, abs/2311.09783, 2023. doi: 10.48550/ARXIV.2311.09783.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proc. of EMNLP, 2021. doi: 10.18653/v1/2021.emnlp-main.98.
- A framework for few-shot language model evaluation, 2023.
- Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
- Arcee’s mergekit: A toolkit for merging large language models. CoRR, abs/2403.13257, 2024. doi: 10.48550/ARXIV.2403.13257. URL https://doi.org/10.48550/arXiv.2403.13257.
- Data contamination quiz: A tool to detect and estimate contamination in large language models. CoRR, abs/2311.06233, 2023a. doi: 10.48550/ARXIV.2311.06233.
- Time travel in llms: Tracing data contamination in large language models. CoRR, abs/2308.08493, 2023b. doi: 10.48550/ARXIV.2308.08493.
- Olmo: Accelerating the science of language models. CoRR, abs/2402.00838, 2024. doi: 10.48550/ARXIV.2402.00838. URL https://doi.org/10.48550/arXiv.2402.00838.
- Measuring massive multitask language understanding. In Proc. of ICLR, 2021.
- Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023.
- Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825.
- Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770, 2023. doi: 10.48550/ARXIV.2310.06770. URL https://doi.org/10.48550/arXiv.2310.06770.
- Yucheng Li. An open source data contamination report for large language models. CoRR, abs/2310.17589, 2023. doi: 10.48550/ARXIV.2310.17589.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proc. of ACL, 2022. doi: 10.18653/v1/2022.acl-long.229.
- Membership inference attacks against language models via neighbourhood comparison. In Findings of ACL, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.719.
- Quantifying privacy risks of masked language models using membership inference attacks. In Proc. of EMNLP, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.570.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774.
- Proving test set contamination in black box language models. CoRR, abs/2310.17623, 2023. doi: 10.48550/ARXIV.2310.17623.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/P16-1144. URL https://doi.org/10.18653/v1/p16-1144.
- NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, 2023.
- Weijia Shi. Detect-pretrain-code-contamination. https://github.com/swj0419/detect-pretrain-code-contamination, 2023.
- Detecting pretraining data from large language models. CoRR, abs/2310.16789, 2023. doi: 10.48550/ARXIV.2310.16789.
- Robert J Tibshirani. Bootstrap confidence intervals. Stanford University. Department of Statistics. Laboratory for Computational …, 1984.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/arXiv.2307.09288.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- Koala: An index for quantifying overlaps with pre-training corpora. In Proc. of EMNLP, 2023.
- Grace Wahba. Estimating the Smoothing Parameter, pages 45–65. 1990. doi: 10.1137/1.9781611970128.ch4. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611970128.ch4.
- Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pages 94–106. Association for Computational Linguistics, 2017. doi: 10.18653/V1/W17-4413. URL https://doi.org/10.18653/v1/w17-4413.
- Transformers: State-of-the-art natural language processing. In Proc. of EMNLP, 2020. doi: 10.18653/v1/2020.emnlp-demos.6.
- Benchmarking benchmark leakage in large language models, 2024.
- Ties-merging: Resolving interference when merging models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html.
- Rethinking benchmark and contamination for language models with rephrased samples. CoRR, abs/2311.04850, 2023. doi: 10.48550/ARXIV.2311.04850.
- Privacy risk in machine learning: Analyzing the connection to overfitting. In 31st IEEE Computer Security Foundations Symposium, CSF 2018, Oxford, United Kingdom, July 9-12, 2018, 2018. doi: 10.1109/CSF.2018.00027.
- Yi: Open foundation models by 01.ai. CoRR, abs/2403.04652, 2024. doi: 10.48550/ARXIV.2403.04652. URL https://doi.org/10.48550/arXiv.2403.04652.
- Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
- A careful examination of large language model performance on grade school arithmetic, 2024.
- Don’t make your LLM an evaluation benchmark cheater. CoRR, abs/2311.01964, 2023. doi: 10.48550/ARXIV.2311.01964.
- CLEAN-EVAL: clean evaluation on contaminated large language models. CoRR, abs/2311.09154, 2023. doi: 10.48550/ARXIV.2311.09154.
- Jasper Dekoninck (8 papers)
- Martin Vechev (103 papers)
- Mark Niklas Müller (23 papers)