When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards (2402.01781v2)

Published 1 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLM leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions. We explain this phenomenon by conducting systematic experiments over three broad categories of benchmark perturbations and identifying the sources of this behavior. Our analysis results in several best-practice recommendations, including the advantage of a hybrid scoring method for answer selection. Our study highlights the dangers of relying on simple benchmark evaluations and charts the path for more robust evaluation schemes on the existing benchmarks. The code for this paper is available at https://github.com/National-Center-for-AI-Saudi-Arabia/lm-evaluation-harness.

PDF HTML Abstract

When Benchmarks are Targets: Analyzing the Sensitivity of LLM Leaderboards

The paper authored by Norah Alzahrani et al. offers a thorough examination of the sensitivity of LLM leaderboards to minor perturbations in Multiple Choice Questions (MCQ) benchmarks. These leaderboards, which frequently guide model selection for practitioners, are shown to be unstable under small changes, which has significant implications for both theoretical understanding and practical application.

The authors conducted a systematic series of experiments on well-known MCQ benchmarks, notably the Massive Multitask Language Understanding (MMLU). Their experiments reveal that even trivial modifications, such as altering the order of choices or using different scoring methods, can cause dramatic shifts in model rankings. These changes can result in varying rankings by as many as 8 positions, thereby unequivocally demonstrating the fragility and potential unreliability of these benchmarks.

Key findings from the paper are as follows:

Leaderboard Instability: The authors identified substantial variability in model rankings under slight perturbations. For instance, randomizing the order of answer choices led to major rank changes, notably with the Yi-6b model shifting from 3rd to 9th place.
Sources of Bias: The paper explores sources of bias such as token and positional biases. LLMs showed a clear preference for specific choice positions and symbols, and these biases affected model performance unpredictably.
Scoring Method Impact: The choice of scoring method (symbol, hybrid, or cloze scoring) was another significant source of instability. Symbol scoring, despite being the most common, led to high selection biases, while cloze scoring, although reducing bias, resulted in the poorest performance scores. Hybrid scoring provided a more balanced evaluation method.
In-Context Knowledge Sensitivity: In-context manipulations like presenting correct or incorrect answers as part of the context led to models either "cheating" by copying the answers or performing poorly when misleading information was included.

The implications of these findings are manifold. Firstly, there is a need for more robust and consistent benchmark designs to ensure the reliable evaluation of LLMs. The current reliance on unstable benchmarks can lead to inefficiencies and misallocations of resources, especially given the high costs associated with training and deploying LLMs. Secondly, understanding and mitigating biases in LLMs is critical for their applicative validity. Bias towards specific formats or symbols can result in skewed assessments of a model's true capabilities.

Future research should focus on developing benchmarks that are resistant to minor perturbations. This might include integrating a variety of scoring methods, standardizing choice formats, and ensuring the randomization of answer orders in a consistent manner. Additionally, there is a need for transparency in the training datasets used for LLM development to address concerns about potential overfitting to benchmark formats.

In conclusion, this paper underscores the necessity for the AI research community to rethink how LLMs are evaluated and compared. Ensuring the stability and fairness of benchmarks will not only lead to better model selection and resource utilization but also drive the development of more robust AI systems capable of performing reliably in varied and realistic settings. The paper by Alzahrani et al. provides a critical step in this direction, offering valuable insights and practical recommendations for improving the evaluation methodologies in AI research.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (12)

Norah Alzahrani (1 paper)
Hisham Abdullah Alyahya (1 paper)
Yazeed Alnumay (7 papers)
Sultan Alrashed (4 papers)
Shaykhah Alsubaie (1 paper)
Yusef Almushaykeh (1 paper)
Faisal Mirza (1 paper)
Nouf Alotaibi (1 paper)
Nora Altwairesh (1 paper)
Areeb Alowisheq (3 papers)
M Saiful Bari (22 papers)
Haidar Khan (21 papers)

Citations (37)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/haidarkk1/status/1754828398353133699

https://twitter.com/sbmaruf/status/1853498895537446941

https://twitter.com/k_nearest/status/1877144703906283538

https://twitter.com/yvyuz/status/1756118988483187196

https://twitter.com/haidarkk1/status/1808409129850962268

https://twitter.com/Aimped_AI/status/1806696440950931471

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards (2402.01781v2)

When Benchmarks are Targets: Analyzing the Sensitivity of LLM Leaderboards

Related Papers

Tweets

YouTube

HackerNews

Reddit