Dice Question Streamline Icon: https://streamlinehq.com

Potential data contamination in the LeetCode Contest benchmark evaluation

Ascertain whether the released 180-problem LeetCode Contest benchmark dataset (collected July 2023–January 2024) used to evaluate DeepSeek-Coder contains any overlap with the DeepSeek-Coder pretraining corpus that would constitute data contamination, thereby verifying the integrity of the reported evaluation results.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces a LeetCode Contest benchmark consisting of 180 problems collected from July 2023 to January 2024, with 100 test cases per problem, intended to avoid overlap with the models’ pretraining data. The benchmark is used to assess real-world programming capabilities of DeepSeek-Coder and baselines.

Despite these precautions, the authors explicitly note uncertainty about possible data contamination. They observed higher scores for certain months and encourage the community to consider this potential issue when using the released dataset. This explicitly flagged uncertainty motivates determining whether contamination exists between the LeetCode benchmark and DeepSeek-Coder’s pretraining corpus.

References

It is important to acknowledge that despite our diligent efforts to gather the most recent code questions for model testing, the possibility of data contamination cannot be entirely ruled out. We observed that the GPT-4-Turbo and DeepSeek-Coder models achieved higher scores in the LeetCode Contest held in July and August. We encourage the research community to consider the potential issue of data contamination when evaluating models in future studies using our released LeetCode data.

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence (2401.14196 - Guo et al., 25 Jan 2024) in Experimental Results — LeetCode Contest Benchmark subsection