Dice Question Streamline Icon: https://streamlinehq.com

Assess LLM2Vec on Benchmarks Free of Pre-Training Contamination

Investigate the performance of LLM2Vec-transformed decoder-only large language models on newly designed evaluation benchmarks that are guaranteed not to overlap with the models’ pre-training corpora, in order to quantify and mitigate potential test set contamination effects.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors note the possibility that publicly available evaluation datasets could overlap with the pre-training data of models like LLaMA-2-7B and Mistral-7B, which could confound evaluation results.

Because complete details of the pre-training data are not public, the extent of potential contamination is uncertain. The paper explicitly calls for evaluation on newly designed benchmarks excluded from pre-training corpora.

References

We leave it to future work to investigate the performance of these models on newly designed benchmarks that are not part of their pre-training data.

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961 - BehnamGhader et al., 9 Apr 2024) in Appendix, Section "Limitations" (Data contamination from pre-training)