Concerned with Data Contamination? Assessing Countermeasures in Code Language Model (2403.16898v2)
Abstract: Various techniques have been proposed to leverage the capabilities of code LLMs (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2 million Python functions with timestamps ranging from January 1st, 2018, to December 31st, 2023. The data created before the models' cut-off date are considered "contaminated data", while the data where the countermeasures are taken are regarded as "cleansed data". We study the impact of these countermeasures by investigating the difference in CLMs' performance on contaminated and cleansed data derived from different countermeasures. Our experiments yield several interesting observations. For instance, CLMs do not necessarily perform worse on data after the models' cut-off date; on the contrary, they sometimes perform better. In addition, refactoring did not always result in decreased performance; it could lead to improvements instead. Furthermore, existing metrics such as perplexity cannot distinguish contaminated/cleansed data. We hope that the results and observations could help deepen the understanding of CLMs' capabilities and inform the community about data contamination.
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” in Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis, 2023, pp. 423–435.
- Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, “Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13.
- C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang, “Fuzz4all: Universal fuzzing with large language models,” Proc. IEEE/ACM ICSE, 2024.
- C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1482–1494.
- S. Golchin and M. Surdeanu, “Time travel in llms: Tracing data contamination in large language models,” CoRR, vol. abs/2308.08493, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.08493
- O. Sainz, J. Campos, I. García-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 776–10 787. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.722
- C. Li and J. Flanigan, “Task contamination: Language models may not be few-shot anymore,” CoRR, vol. abs/2312.16337, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.16337
- Y. Li, “An open source data contamination report for llama series models,” arXiv preprint arXiv:2310.17589, 2023.
- Y. Wu, N. Jiang, H. V. Pham, T. Lutellier, J. Davis, L. Tan, P. Babkin, and S. Shah, “How effective are neural networks for fixing security vulnerabilities,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 1282–1294. [Online]. Available: https://doi.org/10.1145/3597926.3598135
- T.-O. Li, W. Zong, Y. Wang, H. Tian, Y. Wang, S.-C. Cheung, and J. Kramer, “Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 14–26.
- J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt design, advantages and limitations of chatgpt for deep learning program repair,” arXiv preprint arXiv:2304.08191, 2023.
- A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023.
- K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization without overfitting: Analyzing the training dynamics of large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 38 274–38 290, 2022.
- W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretraining data from large language models,” arXiv preprint arXiv:2310.16789, 2023.
- N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” in 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, M. D. Bailey and R. Greenstadt, Eds. USENIX Association, 2021, pp. 2633–2650. [Online]. Available: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” 2021.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
- X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” 2023.
- H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, T. Xie, and Q. Wang, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” arXiv preprint arXiv:2302.00288, 2023.
- Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-T. Yih, D. Fried, S. Wang, and T. Yu, “DS-1000: A natural and reliable benchmark for data science code generation,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 18 319–18 345.
- C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66
- A. Shirafuji, Y. Oda, J. Suzuki, M. Morishita, and Y. Watanobe, “Refactoring programs using large language models with few-shot examples,” arXiv preprint arXiv:2311.11690, 2023.
- A. H. Huang, H. Wang, and Y. Yang, “Finbert: A large language model for extracting information from financial text,” Contemporary Accounting Research, vol. 40, no. 2, pp. 806–841, 2023.
- N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language models are zero-shot time series forecasters,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- “bigcode/the-stack-v2-dedup,” https://huggingface.co/datasets/bigcode/the-stack-v2-dedup, 2023.
- S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk in machine learning: Analyzing the connection to overfitting,” in 2018 IEEE 31st Computer Security Foundations Symposium (CSF), 2018, pp. 268–282.
- N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating training data mitigates privacy risks in language models,” in International Conference on Machine Learning. PMLR, 2022, pp. 10 697–10 707.
- F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker, “Perplexity—a measure of the difficulty of speech recognition tasks,” The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63–S63, 1977.
- M. Duan, A. Suri, N. Mireshghallah, S. Min, W. Shi, L. Zettlemoyer, Y. Tsvetkov, Y. Choi, D. Evans, and H. Hajishirzi, “Do membership inference attacks work on large language models?” arXiv preprint arXiv:2402.07841, 2024.
- “openai/human-eval,” https://github.com/openai/human-eval/commits/master/data, 2021.
- “Codereval/codereval,” https://github.com/CoderEval/CoderEval/commits/main/CoderEval4Python.json, 2023.
- A. A. B. Baqais and M. Alshayeb, “Automatic software refactoring: a systematic literature review,” Software Quality Journal, vol. 28, no. 2, pp. 459–502, 2020.
- J. Al Dallal and A. Abdin, “Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review,” IEEE Transactions on Software Engineering, vol. 44, no. 1, pp. 44–69, 2017.
- N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song, “The secret sharer: Evaluating and testing unintended memorization in neural networks,” in 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 267–284.
- Y. Li, “Estimating contamination via perplexity: Quantifying memorisation in language model evaluation,” arXiv preprint arXiv:2309.10677, 2023.
- V. Raychev, M. Vechev, and E. Yahav, “Code completion with statistical language models,” in Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, 2014, pp. 419–428.
- A. Svyatkovskiy, Y. Zhao, S. Fu, and N. Sundaresan, “Pythia: Ai-assisted code completion system,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2727–2735.
- P. Oman and J. Hagemeister, “Metrics for assessing a software system’s maintainability,” in Proceedings Conference on Software Maintenance 1992, 1992, pp. 337–344.
- D. Coleman, D. Ash, B. Lowther, and P. Oman, “Using metrics to evaluate software system maintainability,” Computer, vol. 27, no. 8, pp. 44–49, 1994.
- https://radon.readthedocs.io/en/latest/intro.html#cyclomatic-complexity, 2012.
- J. Bieri, “Cognitive complexity-simplicity and predictive behavior.” The Journal of Abnormal and Social Psychology, vol. 51, no. 2, p. 263, 1955.
- https://pypi.org/project/cognitive-complexity/, 2022.
- S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020. [Online]. Available: https://arxiv.org/abs/2009.10297
- K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training data makes language models better,” CoRR, vol. abs/2107.06499, 2021. [Online]. Available: https://arxiv.org/abs/2107.06499
- A. Z. Broder, “Identifying and filtering near-duplicate documents,” in Combinatorial Pattern Matching, R. Giancarlo and D. Sankoff, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–10.
- https://github.com/johnbumgarner/wordhoard, 2024.
- J.-l. Gailly and M. Adler, “Zlib compression library,” 2004.
- Z. Zhang, J. Wen, and M. Huang, “Ethicist: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation,” arXiv preprint arXiv:2307.04401, 2023.
- D. Kocetkov, R. Li, L. Ben Allal, J. Li, C. Mou, C. Muñoz Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries, “The stack: 3 tb of permissively licensed source code,” Preprint, 2022.
- “bigcode/starcoder,” https://huggingface.co/bigcode/starcoder, 2023.
- “bigcode/starcoderbase,” https://huggingface.co/bigcode/starcoderbase, 2023.
- L. Tunstall, N. Lambert, N. Rajani, E. Beeching, T. Le Scao, L. von Werra, S. Han, P. Schmid, and A. Rush, “Creating a coding assistant with starcoder,” Hugging Face Blog, 2023, https://huggingface.co/blog/starchat.
- “Huggingfaceh4/starchat-beta,” https://huggingface.co/HuggingFaceH4/starchat-beta/commit/4d8424c69643590f193cc97dc7eebff66500ebc6, 2023.
- “timdettmers/openassistant-guanaco,” https://huggingface.co/datasets/timdettmers/openassistant-guanaco, 2023.
- “Openassistant/oasst1,” https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main, 2023.
- Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” CoRR, vol. abs/2306.08568, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.08568
- “Wizardlm/wizardcoder,” https://huggingface.co/WizardLM/WizardCoder-15B-V1.0, 2023.
- S. Chaudhary, “Code alpaca: An instruction-following llama model for code generation,” https://github.com/sahil280114/codealpaca, 2023.
- “sahil280114/codealpaca,” https://github.com/sahil280114/codealpaca/commit/d269da106a579a623a654529b3cb91b5dfa9c72f, 2023.
- “codellama/codellama-7b-instruct-hf,” https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf, 2023.
- “codellama/codellama-7b-instruct-hf commit,” https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf/commit/65db8fcae13921e49c3dd0a2be4757102d0e723f, 2023.
- “meta-llama/llama-2-7b,” https://huggingface.co/meta-llama/Llama-2-7b, 2023.
- “Phind/phind-codellama-34b-v2,” https://huggingface.co/Phind/Phind-CodeLlama-34B-v2, 2023.
- “Phind/phind-codellama-34b-v2,” https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/commit/29c3be6006297754f344ba05678c038b0b77f6c0, 2023.
- Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
- A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
- “Thudm/chatglm3-6b,” https://huggingface.co/THUDM/chatglm3-6b, 2023.
- “Thudm/chatglm3-6b,” https://huggingface.co/THUDM/chatglm3-6b/commit/62acdaa77a5742d120acf9b6419656b403218c3d, 2023.
- “Gpt-3.5-turbo model availability,” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35-turbo-model-availability, 2023.
- “Github copilot,” https://copilot.microsoft.com/, 2023.
- “Introducing github copilot: Ai pair programmer,” https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/, 2023.
- “Github copilot november 30th update,” https://github.blog/changelog/2023-11-30-github-copilot-november-30th-update/, 2023.
- “Gpt-4 and gpt-4-turbo preview,” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-preview, 2023.
- A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH
- S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” arXiv preprint arXiv:2308.02828, 2023.
- “Github copilot is generally available to all developers,” https://github.blog/2022-06-21-github-copilot-is-generally-available-to- all-developers/, 2022.
- M. Kim, L. Bergman, T. Lau, and D. Notkin, “An ethnographic study of copy and paste programming practices in oopl,” in Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE ’04., 2004, pp. 83–92.
- S. Balloccu, P. Schmidtová, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Y. Graham and M. Purver, Eds. St. Julian’s, Malta: Association for Computational Linguistics, Mar. 2024, pp. 67–93. [Online]. Available: https://aclanthology.org/2024.eacl-long.5
- T. Schick and H. Schütze, “It’s not just size that matters: Small language models are also few-shot learners,” arXiv preprint arXiv:2009.07118, 2020.
- I. Magar and R. Schwartz, “Data contamination: From memorization to exploitation,” arXiv preprint arXiv:2203.08242, 2022.
- J. Mattern, F. Mireshghallah, Z. Jin, B. Schölkopf, M. Sachan, and T. Berg-Kirkpatrick, “Membership inference attacks against language models via neighbourhood comparison,” arXiv preprint arXiv:2305.18462, 2023.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=1qvx610Cu7
- R. A. Poldrack, T. Lu, and G. Beguš, “Ai-assisted coding: Experiments with gpt-4,” arXiv preprint arXiv:2304.13187, 2023.
- D. Noever and K. Williams, “Chatbots as fluent polyglots: Revisiting breakthrough code snippets,” arXiv preprint arXiv:2301.03373, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.