Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation (2406.13990v2)

Published 20 Jun 2024 in cs.CL

Abstract: The training process of LLMs often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Qin Zhu (11 papers)
  2. Qingyuan Cheng (2 papers)
  3. Runyu Peng (4 papers)
  4. Xiaonan Li (48 papers)
  5. Tengxiao Liu (7 papers)
  6. Ru Peng (12 papers)
  7. Xipeng Qiu (257 papers)
  8. Xuanjing Huang (287 papers)
Citations (1)