How Can I Publish My LLM Benchmark Without Giving the True Answers Away? (2505.18102v1)

Published 23 May 2025 in cs.LG, cs.AI, cs.CL, and stat.ME

Abstract: Publishing a LLM benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

Summary

Publication of LLM Benchmarks with Concealed Ground-Truth Answers

The paper by Ishida et al. introduces a novel approach for publishing benchmarks intended to evaluate LLMs, while simultaneously preventing potential contamination from training data that could distort future assessments. The primary concern outlined is the risk of benchmarks becoming part of training sets for subsequent LLMs, thus undermining their effectiveness as evaluation tools. The common practice of keeping benchmarks private to avoid such contamination poses challenges including reliance on centralized trust, administrative burden, and potential test-set overfitting via repeated queries.

Proposed Approach

The authors propose an innovative method that involves randomizing the answers to benchmark questions. This is achieved by preparing a collection of logically equivalent answers, selecting one randomly as the official solution. This strategy not only aids in preserving the integrity of the benchmarks by reducing Bayes accuracy—the maximum achievable accuracy—but also serves to detect instances of data contamination. The introduction of randomized answers ensures that fully capable models should not exceed the established Bayes accuracy, which provides a reliable indication of contamination if breached.

Experimental Evidence and Theoretical Foundation

Through rigorous theoretical analysis and empirical evidence across diverse benchmark datasets—encompassing math, science, and logical reasoning questions—the paper demonstrates the utility and reliability of this method in detecting data contamination. The tests were conducted on various LLMs, including those from the Llama and Qwen families. This broad applicability underscores the robustness of the method across different model architectures and training methodologies.

Experimental results reveal that the modified benchmarks effectively track improvements in LLMs over time without necessitating access to the true answers. Such results are pivotal as they affirm that the strategy maintains the evaluation benchmark's primary role of charting advancements in model capabilities, while offering practical contamination detection.

Practical Implications and Future Directions

The implications of this research are profound. By mitigating contamination risks, the stakeholders—dataset creators, model developers, and AI researchers—can preserve the utility of benchmarks as gold standards in evaluating LLMs. Moreover, the proposed method enhances accessibility as it does not require disclosure of ground-truth answers, thereby making benchmarks more widely available without compromising their integrity.

In terms of future developments, the paper suggests exploring extensions of this methodology to broader classes of tasks. Given the method's constraint to specific task types, addressing this limitation could expand its applicability further. Additionally, research into balancing Bayes accuracy adjustments to ensure detection capability without affecting LLM performance tracking is essential.

By providing a robust framework for benchmark publication and contamination detection, this paper represents a significant advancement in empirical AI evaluation methodologies. It preserves the high-quality benchmarks crucial for fostering continued improvements in AI technologies.