Determining LLM training-data exposure to evaluation datasets

Ascertain whether GPT-3.5-turbo, PaLM2 (chat-bison-001), and Falcon7b-instruct were exposed during pretraining, fine-tuning, or reinforcement learning from human feedback to the HOT toxicity dataset, SST-5 sentiment dataset, RumorEval dataset, and GVFC news frame dataset used in this study, to enable rigorous evaluation without data leakage.

Background

Rigorous evaluation of LLMs is complicated by the possibility that their training data may include commonly used benchmark datasets. If exposure occurred, performance estimates could be inflated or confounded, undermining the validity of empirical comparisons and conclusions.

The authors explicitly state that they cannot determine whether the models they evaluated were trained on or otherwise exposed to the datasets used in the experiments, highlighting a critical uncertainty that affects reproducibility and evaluation integrity.

References

Conducting rigorous evaluation of LLMs is challenging because we cannot determine whether these models have been exposed to our chosen datasets during their training phases, particularly popular datasets like SST-5.

— Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways (2406.11980 - Atreja et al., 2024) in Section 6 (Limitations)

Determining LLM training-data exposure to evaluation datasets

Sponsor

Background

References

Related Problems