Training on the Benchmark Is Not All You Need (2409.01790v1)
Abstract: The success of LLMs relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of LLMs, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- BAAI. 2023. Aquila2.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models. arXiv preprint arXiv:2401.15927.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.
- Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462.
- Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623.
- Team, I. 2023. Internlm: A multilingual language model with progressively enhanced capabilities.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341.
- Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9556–9567.
- Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
- Shiwen Ni (34 papers)
- Xiangtao Kong (13 papers)
- Chengming Li (28 papers)
- Xiping Hu (46 papers)
- Ruifeng Xu (66 papers)
- Jia Zhu (41 papers)
- Min Yang (239 papers)