Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (2311.04850v2)

Published 8 Nov 2023 in cs.CL and cs.AI

Abstract: LLMs are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. Furthermore, we demonstrate that if such variation of test data is not eliminated, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4. We validate such observations in widely used benchmarks such as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a stronger LLM-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. For example, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps. Interestingly, we also find such contamination in synthetic dataset generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We urge the community to adopt stronger decontamination approaches when using public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. Our decontamination tool is publicly available at https://github.com/lm-sys/LLM-decontaminator.

PDF Abstract

Insights on "Rethinking Benchmark and Contamination for LLMs with Rephrased Samples"

Overview

The paper entitled "Rethinking Benchmark and Contamination for LLMs with Rephrased Samples" presents a critical analysis of benchmark contamination in LLMs and proposes a new method of decontamination. It underscores the inadequacies of existing decontamination approaches and introduces "rephrased samples" as a challenging contamination type. The paper illuminates the ease with which models can overfit test benchmarks when rephrased samples are included in training datasets, thus inflating performance scores.

Key Findings and Contributions

LLMs are commonly trained on extensive datasets that may inadvertently include benchmark test samples. Traditional decontamination strategies—like n-gram overlap—fall short in identifying cleverly rephrased versions of these test samples, termed "rephrased samples," which maintain semantic consistency while evading detection. Here's a detailed exploration of the primary contributions and findings:

Rephrased Sample Problem: The research demonstrates that when test set variations remain in training data, a 13B parameter model can achieve performance akin to GPT-4. For instance, the Llama-2-13B, fine-tuned on such data, achieves 85.9% on the MMLU and 81.1% on HumanEval.
LLM Decontaminator: The authors propose a novel LLM-based decontaminator that utilizes a two-step process—initially employing embedding similarity to identify potentially contaminated samples, followed by using a high-quality LLM to verify semantic similarity. This method produces more accurate contamination detection than previous techniques.
Empirical Validation: Applying the decontaminator tool to well-known datasets such as RedPajama-Data-1T and CodeAlpaca revealed significant contamination levels, confirming the inefficacy of older methodologies in these contexts.

Implications

The implications of this research are multifaceted, affecting both the theoretical understanding and practical deployment of LLMs:

Benchmark Reliability: By evidencing the vulnerabilities of prevalent benchmarks to subtle contamination forms, the work calls into question many reported advancements in model performance metrics.
Evaluation Methods: The paper advocates for stronger decontamination techniques and the development of dynamic, one-time-use benchmarks that would provide a more accurate depiction of a model’s real-world efficacy.
Synthetic Data Considerations: The paper also highlights the risk of contamination within synthetic datasets generated by other LLMs, suggesting caution and rigorous validation processes.

Future Directions

Looking ahead, this work opens various avenues for further investigation and development:

Threshold Optimization: Fine-tuning the parameters such as similarity thresholds in embedding-based searches could enhance contamination detection efficacy.
Automated Rephrasing Detection: Enhanced algorithms that automatically recognize and counteract rephrasing and translation, possibly using multi-lingual embeddings, could offer robust solutions.
Real-Time Benchmarks: Developing adaptive, real-time benchmark systems that are continually updated to prevent any form of memorization could provide a fair testing ground for novel models.

Conclusion

In sum, the paper presents a meticulous examination of the benchmark contamination issue within the expansive domain of LLMs, offering significant insights into more reliable assessment techniques. By identifying the shortcomings of current decontamination methods and introducing an innovative LLM-based tool, the authors encourage a paradigm shift towards more rigorous and trustworthy evaluation protocols in AI research.