Ascertain whether GPT-4’s pretraining data includes coq-wigderson

Ascertain whether the OpenAI GPT-4 pretraining dataset includes the coq-wigderson repository ("Towards the Formal Verification of Wigderson’s Algorithm"), which was used to evaluate the Cobblestone proof-synthesis approach and whose first commit to GitHub dates to March 2022—after GPT-4’s publicly stated pretraining cutoff in September 2021.

Background

To mitigate test data leakage risks in evaluating LLM-based tools, the authors supplement the CoqGym benchmark with coq-wigderson, which entered GitHub in March 2022, after GPT-4’s documented pretraining cutoff (September 2021). This choice aims to reduce the likelihood that GPT-4 saw the evaluation data during pretraining and thus inflate results via memorization.

However, because GPT-4’s pretraining corpus is not publicly disclosed, the authors acknowledge an unresolved uncertainty about whether coq-wigderson might nevertheless have been included in GPT-4’s pretraining data. Determining the presence or absence of coq-wigderson in GPT-4’s pretraining dataset is necessary to conclusively rule out leakage effects in the evaluation.

References

Still, we cannot know for certain that coq-wigderson is not in the GPT-4 pretraining data.

— Cobblestone: Iterative Automation for Formal Verification (2410.19940 - Kasibatla et al., 2024) in Threats to Validity, Section 4.6 (near end of paper)

Ascertain whether GPT-4’s pretraining data includes coq-wigderson

Background

References

Related Problems