Determine impact of benchmark contamination on LLM performance claims

Determine whether the reported performance gains of best-performing large language models on widely used NLP benchmarks are attributable to training data contamination and memorization due to inclusion of benchmark data in the training corpus, given the lack of transparency about training datasets. Establish methods to detect and quantify such contamination and its effect on evaluation outcomes to ensure valid comparisons and conclusions.

Background

The paper cautions that some high-performing LLMs are trained on undisclosed datasets, which raises the possibility that benchmark tasks may have been seen during training. This undermines the validity of performance comparisons and scientific conclusions, especially in sensitive applications like finance where reliability is critical.

The authors frame this uncertainty as a broader issue in LLM evaluation, citing recent work on data contamination and malpractices. They emphasize that without transparency, it is impossible to confidently attribute improvements to genuine generalization rather than memorization.

References

The lack of training data transparency associated with some of the best-performing LLMs means that we cannot be certain whether some of the performance gains are due to the memorisation of benchmarks being in the training datasets.

— Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs (2407.17624 - Drinkall et al., 24 Jul 2024) in Section 1 Introduction

Determine impact of benchmark contamination on LLM performance claims

Sponsor

Background

References

Related Problems