Verify GPQA Diamond training exposure of gpt-4o-mini and gpt-4o

Determine whether OpenAI’s gpt-4o-mini and gpt-4o were exposed during training to GPQA Diamond multiple-choice questions in their default unshuffled form, in order to assess potential positional-answer biases and interpret reported performance differences between shuffled and unshuffled settings.

Background

The paper observes a marked performance differential between shuffled and unshuffled GPQA Diamond evaluations, suggesting possible positional-answer biases. Interpreting this difference depends on understanding whether the evaluated models may have encountered GPQA Diamond items in their original unshuffled configuration during training.

Because OpenAI’s models are proprietary, the authors cannot confirm the training data provenance. Resolving this uncertainty would clarify the source of position-dependent effects and strengthen the validity of the comparative results reported for AGoT.

References

However, due to the proprietary nature of the model, we can not verify conclusively whether gpt-4o-mini or gpt-4o was exposed to GPQA Diamond questions (in their default, unshuffled state) during training.