Determine impact of benchmark contamination on LLM performance claims
Determine whether the reported performance gains of best-performing large language models on widely used NLP benchmarks are attributable to training data contamination and memorization due to inclusion of benchmark data in the training corpus, given the lack of transparency about training datasets. Establish methods to detect and quantify such contamination and its effect on evaluation outcomes to ensure valid comparisons and conclusions.
References
The lack of training data transparency associated with some of the best-performing LLMs means that we cannot be certain whether some of the performance gains are due to the memorisation of benchmarks being in the training datasets.
— Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs
(2407.17624 - Drinkall et al., 24 Jul 2024) in Section 1 Introduction