When Benchmarks are Targets: Analyzing the Sensitivity of LLM Leaderboards
The paper authored by Norah Alzahrani et al. offers a thorough examination of the sensitivity of LLM leaderboards to minor perturbations in Multiple Choice Questions (MCQ) benchmarks. These leaderboards, which frequently guide model selection for practitioners, are shown to be unstable under small changes, which has significant implications for both theoretical understanding and practical application.
The authors conducted a systematic series of experiments on well-known MCQ benchmarks, notably the Massive Multitask Language Understanding (MMLU). Their experiments reveal that even trivial modifications, such as altering the order of choices or using different scoring methods, can cause dramatic shifts in model rankings. These changes can result in varying rankings by as many as 8 positions, thereby unequivocally demonstrating the fragility and potential unreliability of these benchmarks.
Key findings from the paper are as follows:
- Leaderboard Instability: The authors identified substantial variability in model rankings under slight perturbations. For instance, randomizing the order of answer choices led to major rank changes, notably with the Yi-6b model shifting from 3rd to 9th place.
- Sources of Bias: The paper explores sources of bias such as token and positional biases. LLMs showed a clear preference for specific choice positions and symbols, and these biases affected model performance unpredictably.
- Scoring Method Impact: The choice of scoring method (symbol, hybrid, or cloze scoring) was another significant source of instability. Symbol scoring, despite being the most common, led to high selection biases, while cloze scoring, although reducing bias, resulted in the poorest performance scores. Hybrid scoring provided a more balanced evaluation method.
- In-Context Knowledge Sensitivity: In-context manipulations like presenting correct or incorrect answers as part of the context led to models either "cheating" by copying the answers or performing poorly when misleading information was included.
The implications of these findings are manifold. Firstly, there is a need for more robust and consistent benchmark designs to ensure the reliable evaluation of LLMs. The current reliance on unstable benchmarks can lead to inefficiencies and misallocations of resources, especially given the high costs associated with training and deploying LLMs. Secondly, understanding and mitigating biases in LLMs is critical for their applicative validity. Bias towards specific formats or symbols can result in skewed assessments of a model's true capabilities.
Future research should focus on developing benchmarks that are resistant to minor perturbations. This might include integrating a variety of scoring methods, standardizing choice formats, and ensuring the randomization of answer orders in a consistent manner. Additionally, there is a need for transparency in the training datasets used for LLM development to address concerns about potential overfitting to benchmark formats.
In conclusion, this paper underscores the necessity for the AI research community to rethink how LLMs are evaluated and compared. Ensuring the stability and fairness of benchmarks will not only lead to better model selection and resource utilization but also drive the development of more robust AI systems capable of performing reliably in varied and realistic settings. The paper by Alzahrani et al. provides a critical step in this direction, offering valuable insights and practical recommendations for improving the evaluation methodologies in AI research.