Determine data contamination on HumanEval and MBPP for large code generation models
Determine whether large language models for code generation have training data contamination with respect to the HumanEval and MBPP benchmarks, specifically whether either benchmark’s problems or solutions are included in the models’ pretraining or fine-tuning corpora, and if so quantify the extent of contamination to ensure fair and reliable evaluation.
References
However, due to limited public knowledge about model training data, it is unclear whether models suffer from data contamination on HumanEval and MBPP~\citep{jain2024livecodebench}.
— CodeRAG-Bench: Can Retrieval Augment Code Generation?
(2406.14497 - Wang et al., 20 Jun 2024) in Section 2.1 (Programming Problems: Basic programming problems)