Determine data contamination on HumanEval and MBPP for large code generation models

Determine whether large language models for code generation have training data contamination with respect to the HumanEval and MBPP benchmarks, specifically whether either benchmark’s problems or solutions are included in the models’ pretraining or fine-tuning corpora, and if so quantify the extent of contamination to ensure fair and reliable evaluation.

Background

The paper evaluates retrieval-augmented code generation across multiple datasets, including the widely-used HumanEval and MBPP benchmarks for basic Python programming. Due to limited transparency about the training data of contemporary LLMs, it is difficult to know whether these evaluation sets have leaked into model pretraining or fine-tuning corpora.

To mitigate suspected contamination risks, the authors include LiveCodeBench, a dataset constructed after the training cutoff of the considered models. However, the underlying question of whether current models are contaminated on HumanEval and MBPP remains unresolved, which affects the validity of comparative evaluations and reported progress in code generation.

References

However, due to limited public knowledge about model training data, it is unclear whether models suffer from data contamination on HumanEval and MBPP~\citep{jain2024livecodebench}.

— CodeRAG-Bench: Can Retrieval Augment Code Generation? (2406.14497 - Wang et al., 20 Jun 2024) in Section 2.1 (Programming Problems: Basic programming problems)

Determine data contamination on HumanEval and MBPP for large code generation models

Sponsor

Background

References

Related Problems